[openib-general] mvapich2 ofed 1.2 problem
Shaun Rowland
rowland at cse.ohio-state.edu
Wed Feb 14 20:31:32 PST 2007
Roland Dreier wrote:
> > When I build using the OFED-1.2-20070208-1508, libibverbs 1.0 is what is
> > built, at least by looking at the .so file result:
> >
> > [rowland at z0 ~]$ ls /usr/local/ofed/lib64/ |grep ibverbs libibverbs.a
> > libibverbs.so
> > libibverbs.so.1
> > libibverbs.so.1.0.0
>
> The soname hasn't changed because the library is still compatible.
> But (I hope at least) OFED has libibverbs 1.1.
The soname is libibverbs.so.1, so I guess the longer name would not
matter anyway. Clearly, what I posted shows the IBVERBS 1.1 ABI is
there. I think I have figured out why our code has this problem. The
problem below is similar to the original one posted about.
I did some experimentation with the srq_pingpong libibverbs example
code. First I built it directly with:
gcc -g -c pingpong.c -I/usr/local/ofed/include
gcc -g -c -D_GNU_SOURCE srq_pingpong.c -I/usr/local/ofed/include
gcc -g -o srq_pingpong srq_pingpong.o pingpong.o -L/usr/local/ofed/lib64
-libverbs
This works. Next I copied srq_pingpong.c to two files:
srq_pingpong_rowland.c
- just has a main function that calls lib_start().
srq_pingpong_lib_rowland.c
- main() changed to lib_start().
This moves all of the SRQ pingpong code into a shared library. If I
build this shared library in this way, it works:
gcc -g -fpic -c pingpong.c -I/usr/local/ofed/include
gcc -g -fpic -c -D_GNU_SOURCE srq_pingpong_lib_rowland.c
-I/usr/local/ofed/include
gcc -g -shared -Wl,-soname,libsrqtest.so -o libsrqtest.so
srq_pingpong_lib_rowland.o pingpong.o -L/usr/local/ofed/lib64 -libverbs
gcc -g -o srq_pingpong_rowland srq_pingpong_rowland.c -L$PWD -lsrqtest
Above I am linking libibverbs directly into my libsrqtest.so
library. This works and the IBVERBS 1.1 ABI is clearly in the
libsrqtest.so file:
[rowland at z1 ibverbs-examples]$ nm libsrqtest.so |grep ibv |head
U ibv_ack_cq_events@@IBVERBS_1.1
U ibv_alloc_pd@@IBVERBS_1.1
U ibv_close_device@@IBVERBS_1.1
U ibv_create_comp_channel@@IBVERBS_1.0
U ibv_create_cq@@IBVERBS_1.1
U ibv_create_qp@@IBVERBS_1.1
U ibv_create_srq@@IBVERBS_1.1
U ibv_dealloc_pd@@IBVERBS_1.1
U ibv_dereg_mr@@IBVERBS_1.1
U ibv_destroy_comp_channel@@IBVERBS_1.0
However, if I build in a similar way to MVAPICH2, the resulting program
fails:
gcc -g -fpic -c pingpong.c -I/usr/local/ofed/include
gcc -g -fpic -c -D_GNU_SOURCE srq_pingpong_lib_rowland.c
-I/usr/local/ofed/include
gcc -g -shared -Wl,-soname,libsrqtest.so -o libsrqtest.so
srq_pingpong_lib_rowland.o pingpong.o
gcc -g -o srq_pingpong_rowland srq_pingpong_rowland.c -L$PWD
-L/usr/local/ofed/lib64 -lsrqtest -libverbs
Above I am not linking libibverbs into libsrqtest.so, thus it is
required on the last gcc line. This is how MVAPICH2's libmpich.so file
works, and from past experience, I've seen this before. Running shows:
[rowland at z1 ibverbs-examples]$ gdb ./srq_pingpong_rowland
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host
libthread_db library "/lib64/tls/libthread_db.so.1".
(gdb) r
Starting program:
/home/7/rowland/z1-test/ibverbs-examples/srq_pingpong_rowland
[Thread debugging using libthread_db enabled]
[New Thread 182896403968 (LWP 29858)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182896403968 (LWP 29858)]
post_srq_recv_wrapper_1_0 (srq=0x5075b0, wr=0x7fbfff88d0,
bad_wr=0x7fbfff88c8)
at src/compat-1_0.c:312
312 src/compat-1_0.c: No such file or directory.
in src/compat-1_0.c
(gdb) bt
#0 post_srq_recv_wrapper_1_0 (srq=0x5075b0, wr=0x7fbfff88d0,
bad_wr=0x7fbfff88c8) at src/compat-1_0.c:312
#1 0x0000002a95559e12 in ibv_post_srq_recv (srq=0x5075b0,
recv_wr=0x7fbfff88d0, bad_recv_wr=0x7fbfff88c8)
at /usr/local/ofed/include/infiniband/verbs.h:915
#2 0x0000002a95559dcf in pp_post_recv (ctx=0x5023d0, n=500)
at srq_pingpong_lib_rowland.c:496
#3 0x0000002a9555a614 in lib_start (argc=1, argv=0x7fbffff7f8)
at srq_pingpong_lib_rowland.c:696
#4 0x0000000000400608 in main (argc=1, argv=0x7fbffff7f8)
at srq_pingpong_rowland.c:36
(gdb) quit
It is not clear to me why the difference of either linking libibverbs
into libsrqtest.so or not doing so causes the IBVERBS 1.1 ABI to be used
or not. I looked at the libibverbs code, and the 1.1 ABI is the default.
The libsrqtest.so file in the above case seems to have lost this
information:
[rowland at z1 ibverbs-examples]$ nm libsrqtest.so |grep ibv |head
U ibv_ack_cq_events
U ibv_alloc_pd
U ibv_close_device
U ibv_create_comp_channel
U ibv_create_cq
U ibv_create_qp
U ibv_create_srq
U ibv_dealloc_pd
U ibv_dereg_mr
U ibv_destroy_comp_channel
I've never had to deal with an ABI issue like this in shared library
linking/usage. Does it make sense for this to be the case? I think
perhaps it does, but I wanted to ask.
I've placed my test code here if it helps:
http://www.cse.ohio-state.edu/~rowland/ibverbs-examples.tar.gz
I have a fix for our code that I am testing now. It seems to work and
solve the observed problems, but more testing will be required to be
sure there are no issues. This will require a new SRPM if the fix is
required, which it seems at this point.
--
Shaun Rowland rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/
More information about the general
mailing list