[openib-general] Problems running MPI jobs with MVAPICH
Don.Albert at Bull.com
Don.Albert at Bull.com
Tue Mar 28 22:53:26 PST 2006
Weikuan,
This is a followup to my email from last week about problems running the
simple "cpi" job between two EM64T machines using the OpenIB and MVAPICH
stacks from the OpenIB tree.
The suggestion that configuration differences on the two systems were
causing the problem certainly sounded plausible. Since we could not
determine the exact provenance of the software and configurations on the
two systems, we decided to wipe them and start over. So we installed
the latest RedHat Enterprise Linux 4, Update 3 distribution, with the
2.6.9-34.ELsmp kernel on both machines from scratch. Then I pulled the
latest (svn 6035) version of the userspace and mpi sources.
Since the RHEL4, Update 3, release seems to have all the kernel OpenIB
modules, and the HCA ports came up 'ACTIVE', I decided not to install the
latest 2.6.16 kernel initially, but just to build the userspace libraries
and mvapich-gen2 code.
Instead of building on both systems separately, I built all the code on
one system and copied the libraries and executables to the other system. I
can run the "ibv_rc_pingpong" and "ibv_ud_pingpong" tests between the two
systems, so I think all the software is functioning.
The bottom line is that the problem is exactly the same as before: I can
run MPD and spawn jobs on the local system, or force a job to execute on
the other system, and "mpdtrace" shows the following:
[koa] (ib) ib> mpdtrace
mpdtrace: koa_32841: lhs=jatoba_32833 rhs=jatoba_32833 rhs2=koa_32841
gen=1
mpdtrace: jatoba_32833: lhs=koa_32841 rhs=koa_32841 rhs2=jatoba_32833
gen=1
but when I try to run jobs that execute on both systems, I get the
following on the initiating system:
[koa] (ib) ib> mpirun_mpd -np 2 /home/ib/mpi/tests/cpi/cpi
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
len_local' failed.
[man_0]: application program exited abnormally with status 0
[man_0]: application program signaled with signal 6 (: Aborted)
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
len_local' failed.
[koa] (ib) ib>
and I see the following on the remote system:
[jatoba] (ib) ib> [man_1]: application program exited abnormally with
status 0
[man_1]: application program signaled with signal 6 (: Aborted)
Are there any logs or traces I can collect or turn on to help isolate this
problem? Should I just comment out the "Assertion" in the code and see
how far I get? Attached are the configuration and build logs.
-Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.log
Type: application/octet-stream
Size: 6970 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.status
Type: application/octet-stream
Size: 20899 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config-mine.log
Type: application/octet-stream
Size: 17513 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure
Type: application/octet-stream
Size: 512039 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: install-mine.log
Type: application/octet-stream
Size: 1852 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make-mine.log
Type: application/octet-stream
Size: 290613 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060328/ac0480e7/attachment-0005.obj>
More information about the general
mailing list