[openib-general] Problems running MPI jobs with MVAPICH

Don.Albert at Bull.com Don.Albert at Bull.com
Tue Mar 28 22:53:26 PST 2006


Weikuan,

This is a followup to my email from last week about problems running the 
simple "cpi" job between two EM64T machines using the OpenIB and MVAPICH 
stacks from the OpenIB tree.

The suggestion that configuration differences on the two systems were 
causing the problem certainly sounded plausible.   Since we could not 
determine the exact provenance of the software and configurations on the 
two systems,  we decided to wipe them and start over.   So we installed 
the latest RedHat Enterprise Linux 4, Update 3 distribution, with the 
2.6.9-34.ELsmp kernel on both machines from scratch.  Then I pulled the 
latest (svn 6035) version of the userspace and mpi sources.

Since the RHEL4, Update 3, release seems to have all the kernel OpenIB 
modules, and the HCA ports came up 'ACTIVE',  I decided not to install the 
latest 2.6.16 kernel initially, but just to build the userspace libraries 
and mvapich-gen2 code.

Instead of building on both systems separately,  I built all the code on 
one system and copied the libraries and executables to the other system. I 
can run the "ibv_rc_pingpong" and "ibv_ud_pingpong" tests between the two 
systems, so I think all the software is functioning.

The bottom line is that the problem is exactly the same as before:   I can 
run MPD and spawn jobs on the local system,  or force a job to execute on 
the other system,  and "mpdtrace" shows the following:

[koa] (ib) ib> mpdtrace
mpdtrace: koa_32841:  lhs=jatoba_32833  rhs=jatoba_32833  rhs2=koa_32841 
gen=1
mpdtrace: jatoba_32833:  lhs=koa_32841  rhs=koa_32841  rhs2=jatoba_32833 
gen=1

but when I try to run jobs that execute on both systems,  I get the 
following on the initiating system:

[koa] (ib) ib> mpirun_mpd -np 2 /home/ib/mpi/tests/cpi/cpi
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote == 
len_local' failed.
[man_0]: application program exited abnormally with status 0
[man_0]: application program signaled with signal 6 (: Aborted)
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote == 
len_local' failed.
[koa] (ib) ib>

and I see the following on the remote system:

[jatoba] (ib) ib> [man_1]: application program exited abnormally with 
status 0
[man_1]: application program signaled with signal 6 (: Aborted)

Are there any logs or traces I can collect or turn on to help isolate this 
problem?   Should I just comment out the "Assertion" in the code and see 
how far I get?   Attached are the configuration and build logs.

        -Don Albert-



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.log
Type: application/octet-stream
Size: 6970 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.status
Type: application/octet-stream
Size: 20899 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config-mine.log
Type: application/octet-stream
Size: 17513 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure
Type: application/octet-stream
Size: 512039 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: install-mine.log
Type: application/octet-stream
Size: 1852 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make-mine.log
Type: application/octet-stream
Size: 290613 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0005.obj>


More information about the general mailing list