[openib-general] Problems running MPI jobs with MVAPICH

Weikuan Yu yuw at cse.ohio-state.edu
Wed Mar 29 03:34:52 PST 2006


Don,

Thanks for reporting the problem.

> problem?   Should I just comment out the "Assertion" in the code and see
> how far I get?   Attached are the configuration and build logs.
>

Could you try to run MVAPICH (1 or 2) with tcp option first, or use
mpirun_rsh initially? Just to be sure you can run any MPI to start with.
Please let us know of the problems too, along with the same set of traces
and machine specification.

Weikuan



On Tue, 28 Mar 2006 Don.Albert at Bull.com wrote:

> Weikuan,
>
> This is a followup to my email from last week about problems running the
> simple "cpi" job between two EM64T machines using the OpenIB and MVAPICH
> stacks from the OpenIB tree.
>
> The suggestion that configuration differences on the two systems were
> causing the problem certainly sounded plausible.   Since we could not
> determine the exact provenance of the software and configurations on the
> two systems,  we decided to wipe them and start over.   So we installed
> the latest RedHat Enterprise Linux 4, Update 3 distribution, with the
> 2.6.9-34.ELsmp kernel on both machines from scratch.  Then I pulled the
> latest (svn 6035) version of the userspace and mpi sources.
>
> Since the RHEL4, Update 3, release seems to have all the kernel OpenIB
> modules, and the HCA ports came up 'ACTIVE',  I decided not to install the
> latest 2.6.16 kernel initially, but just to build the userspace libraries
> and mvapich-gen2 code.
>
> Instead of building on both systems separately,  I built all the code on
> one system and copied the libraries and executables to the other system. I
> can run the "ibv_rc_pingpong" and "ibv_ud_pingpong" tests between the two
> systems, so I think all the software is functioning.
>
> The bottom line is that the problem is exactly the same as before:   I can
> run MPD and spawn jobs on the local system,  or force a job to execute on
> the other system,  and "mpdtrace" shows the following:
>
> [koa] (ib) ib> mpdtrace
> mpdtrace: koa_32841:  lhs=jatoba_32833  rhs=jatoba_32833  rhs2=koa_32841
> gen=1
> mpdtrace: jatoba_32833:  lhs=koa_32841  rhs=koa_32841  rhs2=jatoba_32833
> gen=1
>
> but when I try to run jobs that execute on both systems,  I get the
> following on the initiating system:
>
> [koa] (ib) ib> mpirun_mpd -np 2 /home/ib/mpi/tests/cpi/cpi
> cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
> len_local' failed.
> [man_0]: application program exited abnormally with status 0
> [man_0]: application program signaled with signal 6 (: Aborted)
> cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
> len_local' failed.
> [koa] (ib) ib>
>
> and I see the following on the remote system:
>
> [jatoba] (ib) ib> [man_1]: application program exited abnormally with
> status 0
> [man_1]: application program signaled with signal 6 (: Aborted)
>
> Are there any logs or traces I can collect or turn on to help isolate this
> problem?   Should I just comment out the "Assertion" in the code and see
> how far I get?   Attached are the configuration and build logs.
>
>         -Don Albert-
>
>
>
>






More information about the general mailing list