[openib-general] openMPI for 2.6.17.10 kernel

david elsen elsen_david at yahoo.com
Fri Dec 1 19:07:24 PST 2006


Shaun,
   
  It was working on one of my Fedora system. I tried to do the same installation on my other system which has SuSe 9.3 and it is not working there.
   
  So I am not sure what is going on with this.
   
  Thanks,
  David
  

Shaun Rowland <rowland at cse.ohio-state.edu> wrote:
  Steve Wise wrote:
> I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing their
> dev team so maybe they can help.
> 
> Steve.
> 
> 
> 
> On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote:
>> Steve,
>>
>> I can run rping, rdma_lat etc on the Ammasso card but when I try to
>> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. 
>>
>> ./mpdboot -n 1
>> debug: starting
>> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries:
>> librdmacm.so: cannot open shared object file: No such file or
>> directory
>> running mpdallexit on ammasso1
>> LAUNCHED mpd on ammasso1 via 
>> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d
>> debug: mpd on ammasso1 on port 35352
>> RUNNING: mpd on ammasso1
>> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352,
>> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Hello David and Steve. We discussed this problem in detail on the
mvapich-discuss list recently. David, you indicated the following in
your last email about this to mvapich-discuss on 11/26/2006:

"For some reason, it is working in SuSE, and not working in Fedora."

Is this still the case? Were the libraries built specifically on the
Fedora Core 6 system, or are you using libraries that were built on
SuSE? I assume they were built on Fedora Core 6. Were you trying to run
this as root or as a regular user? I am not sure exactly how this might
affect shared library loading, but it is possible there is a difference.

In our previous discussion, your library path did indeed have a
librdmacm.so file, though it could not be loaded for an unknown reason.
It is unclear to me if this email thread indicates that you have tried
to rebuild that and are experiencing the same issue. Where you able to
try running that test shared library example I gave and did it work? Did
it work as the same user you are trying to run MVAPICH as?

It seems clear this is a runtime loader problem on Fedora Core 6, or on
your particular configuration there. That is what cannot find the
library. It is similar to the libtest code I provided as an example:

[rowland at e14-oib libtest]$ ls
Makefile test.c test.h test-program.c

[rowland at e14-oib libtest]$ make normal
gcc -c -fPIC test.c
gcc -shared -Wl,-soname,libtest.so.1 -o libtest.so.1.0 test.o
ln -s libtest.so.1.0 libtest.so.1
ln -s libtest.so.1 libtest.so
gcc -c -o test-program.o test-program.c
gcc -o test-program test-program.o -L/home/7/rowland/libtest -ltest

[rowland at e14-oib libtest]$ ldd test-program
libtest.so.1 => not found
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
/lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
./test-program: error while loading shared libraries: libtest.so.1: 
cannot open shared object file: No such file or directory

[rowland at e14-oib libtest]$ export LD_LIBRARY_PATH=$PWD

[rowland at e14-oib libtest]$ ldd test-program
libtest.so.1 => /home/7/rowland/libtest/libtest.so.1 
(0x00002abbf9aee000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
/lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
In shared library function...

In previous email your ldd output showed the library was being found:

Please see the output of ldd /usr/local/mvapich2/bin/mpdroot :
[root at ammasso1 ~]# ldd /usr/local/mvapich2/bin/mpdroot
linux-gate.so.1 => (0xffffe000)
librdmacm.so => /usr/local/lib/librdmacm.so (0xb7fec000)
libibverbs.so.2 => /usr/local/lib/libibverbs.so.2 (0xb7fe5000)
libibumad.so.1 => /usr/local/lib/libibumad.so.1 (0xb7fdc000)
libpthread.so.0 => /lib/libpthread.so.0 (0x0012a000)
libc.so.6 => /lib/libc.so.6 (0x00ca7000)
libsysfs.so.2 => /usr/lib/libsysfs.so.2 (0x00369000)
libdl.so.2 => /lib/libdl.so.2 (0x00de6000)
libibcommon.so.1 => /usr/local/lib/libibcommon.so.1 (0xb7fcb000)
/lib/ld-linux.so.2 (0x002d8000)

But that path is different than the one you are quoting above. Does an
ldd on /root/0.9.8-RELEASE/bin/mpdroot find librdmacm.so too, as the
same user you are trying to run it as?

I have one more idea for you to try here. You can do the following:

export LD_DEBUG=all
/root/0.9.8-RELEASE/bin/mpdroot >&output
unset LD_DEBUG

Then take a look at the output file to see if there are any relevant
error messages. Don't forget to unset LD_DEBUG before doing anything else.

Also, just to be sure, if you run "file 
" what
does it say? It should indicate that it is a shared library as similarly to:

[rowland at e14-oib libtest]$ file /usr/local/ofed/lib64/librdmacm.so*
/usr/local/ofed/lib64/librdmacm.so: symbolic link to 
`librdmacm.so.0.9.0'
/usr/local/ofed/lib64/librdmacm.so.0.9.0: ELF 64-bit LSB shared object, 
AMD x86-64, version 1 (SYSV), not stripped

Unfortunately, we do not have any Fedora Core 6 systems to investigate
this problem on at this time, and I don't know anything about what might
be there that would cause a problem. As far as I know, there shouldn't
be. However, it seems there is some runtime issue on your Fedora Core 6
machine or with how this is being run there. If it is in fact working on
another distribution as you indicated in your previous response to us,
then that also strongly points in this direction.
-- 
Shaun Rowland rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


 
---------------------------------
Check out the all-new Yahoo! Mail beta - Fire up a more powerful email and get things done faster.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/d8f3ca7f/attachment.html>


More information about the general mailing list