[openib-general] mvapich2 ofed 1.2 problem
Steve Wise
swise at opengridcomputing.com
Thu Feb 15 08:59:45 PST 2007
Shaun,
Lemme know if you have an mvapich2 kit that I can test with iwarp...
Thanks,
Steve.
On Wed, 2007-02-14 at 23:31 -0500, Shaun Rowland wrote:
> Roland Dreier wrote:
> > > When I build using the OFED-1.2-20070208-1508, libibverbs 1.0 is what is
> > > built, at least by looking at the .so file result:
> > >
> > > [rowland at z0 ~]$ ls /usr/local/ofed/lib64/ |grep ibverbs libibverbs.a
> > > libibverbs.so
> > > libibverbs.so.1
> > > libibverbs.so.1.0.0
> >
> > The soname hasn't changed because the library is still compatible.
> > But (I hope at least) OFED has libibverbs 1.1.
>
> The soname is libibverbs.so.1, so I guess the longer name would not
> matter anyway. Clearly, what I posted shows the IBVERBS 1.1 ABI is
> there. I think I have figured out why our code has this problem. The
> problem below is similar to the original one posted about.
>
> I did some experimentation with the srq_pingpong libibverbs example
> code. First I built it directly with:
>
>
> gcc -g -c pingpong.c -I/usr/local/ofed/include
>
> gcc -g -c -D_GNU_SOURCE srq_pingpong.c -I/usr/local/ofed/include
>
> gcc -g -o srq_pingpong srq_pingpong.o pingpong.o -L/usr/local/ofed/lib64
> -libverbs
>
>
> This works. Next I copied srq_pingpong.c to two files:
>
> srq_pingpong_rowland.c
> - just has a main function that calls lib_start().
>
> srq_pingpong_lib_rowland.c
> - main() changed to lib_start().
>
> This moves all of the SRQ pingpong code into a shared library. If I
> build this shared library in this way, it works:
>
>
> gcc -g -fpic -c pingpong.c -I/usr/local/ofed/include
>
> gcc -g -fpic -c -D_GNU_SOURCE srq_pingpong_lib_rowland.c
> -I/usr/local/ofed/include
>
> gcc -g -shared -Wl,-soname,libsrqtest.so -o libsrqtest.so
> srq_pingpong_lib_rowland.o pingpong.o -L/usr/local/ofed/lib64 -libverbs
>
> gcc -g -o srq_pingpong_rowland srq_pingpong_rowland.c -L$PWD -lsrqtest
>
>
> Above I am linking libibverbs directly into my libsrqtest.so
> library. This works and the IBVERBS 1.1 ABI is clearly in the
> libsrqtest.so file:
>
> [rowland at z1 ibverbs-examples]$ nm libsrqtest.so |grep ibv |head
> U ibv_ack_cq_events@@IBVERBS_1.1
> U ibv_alloc_pd@@IBVERBS_1.1
> U ibv_close_device@@IBVERBS_1.1
> U ibv_create_comp_channel@@IBVERBS_1.0
> U ibv_create_cq@@IBVERBS_1.1
> U ibv_create_qp@@IBVERBS_1.1
> U ibv_create_srq@@IBVERBS_1.1
> U ibv_dealloc_pd@@IBVERBS_1.1
> U ibv_dereg_mr@@IBVERBS_1.1
> U ibv_destroy_comp_channel@@IBVERBS_1.0
>
> However, if I build in a similar way to MVAPICH2, the resulting program
> fails:
>
>
> gcc -g -fpic -c pingpong.c -I/usr/local/ofed/include
>
> gcc -g -fpic -c -D_GNU_SOURCE srq_pingpong_lib_rowland.c
> -I/usr/local/ofed/include
>
> gcc -g -shared -Wl,-soname,libsrqtest.so -o libsrqtest.so
> srq_pingpong_lib_rowland.o pingpong.o
>
> gcc -g -o srq_pingpong_rowland srq_pingpong_rowland.c -L$PWD
> -L/usr/local/ofed/lib64 -lsrqtest -libverbs
>
>
> Above I am not linking libibverbs into libsrqtest.so, thus it is
> required on the last gcc line. This is how MVAPICH2's libmpich.so file
> works, and from past experience, I've seen this before. Running shows:
>
> [rowland at z1 ibverbs-examples]$ gdb ./srq_pingpong_rowland
> GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host
> libthread_db library "/lib64/tls/libthread_db.so.1".
>
> (gdb) r
> Starting program:
> /home/7/rowland/z1-test/ibverbs-examples/srq_pingpong_rowland
> [Thread debugging using libthread_db enabled]
> [New Thread 182896403968 (LWP 29858)]
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 182896403968 (LWP 29858)]
> post_srq_recv_wrapper_1_0 (srq=0x5075b0, wr=0x7fbfff88d0,
> bad_wr=0x7fbfff88c8)
> at src/compat-1_0.c:312
> 312 src/compat-1_0.c: No such file or directory.
> in src/compat-1_0.c
> (gdb) bt
> #0 post_srq_recv_wrapper_1_0 (srq=0x5075b0, wr=0x7fbfff88d0,
> bad_wr=0x7fbfff88c8) at src/compat-1_0.c:312
> #1 0x0000002a95559e12 in ibv_post_srq_recv (srq=0x5075b0,
> recv_wr=0x7fbfff88d0, bad_recv_wr=0x7fbfff88c8)
> at /usr/local/ofed/include/infiniband/verbs.h:915
> #2 0x0000002a95559dcf in pp_post_recv (ctx=0x5023d0, n=500)
> at srq_pingpong_lib_rowland.c:496
> #3 0x0000002a9555a614 in lib_start (argc=1, argv=0x7fbffff7f8)
> at srq_pingpong_lib_rowland.c:696
> #4 0x0000000000400608 in main (argc=1, argv=0x7fbffff7f8)
> at srq_pingpong_rowland.c:36
> (gdb) quit
>
> It is not clear to me why the difference of either linking libibverbs
> into libsrqtest.so or not doing so causes the IBVERBS 1.1 ABI to be used
> or not. I looked at the libibverbs code, and the 1.1 ABI is the default.
> The libsrqtest.so file in the above case seems to have lost this
> information:
>
> [rowland at z1 ibverbs-examples]$ nm libsrqtest.so |grep ibv |head
> U ibv_ack_cq_events
> U ibv_alloc_pd
> U ibv_close_device
> U ibv_create_comp_channel
> U ibv_create_cq
> U ibv_create_qp
> U ibv_create_srq
> U ibv_dealloc_pd
> U ibv_dereg_mr
> U ibv_destroy_comp_channel
>
> I've never had to deal with an ABI issue like this in shared library
> linking/usage. Does it make sense for this to be the case? I think
> perhaps it does, but I wanted to ask.
>
> I've placed my test code here if it helps:
>
> http://www.cse.ohio-state.edu/~rowland/ibverbs-examples.tar.gz
>
> I have a fix for our code that I am testing now. It seems to work and
> solve the observed problems, but more testing will be required to be
> sure there are no issues. This will require a new SRPM if the fix is
> required, which it seems at this point.
More information about the general
mailing list