[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2

Matthew Koop koop at cse.ohio-state.edu
Wed Mar 22 11:11:17 PST 2006


Don,

Sorry to hear you are having problems, however, we are unable to replicate
the problems you are seeing.

Since the MPD daemons seem to have been started properly, as your ability
to run non-MPI commands shows, I'm wondering if your $PATH may not be set
properly when compiling the cpi application.

Can you just verify by 'echo'ing the $PATH before compiling? Can you also
provide us with other information about your environment?

Also, when running MVAPICH2, can you try using the 'mpdtrace' command to
verify that MPD has properly started on the requested nodes?

Matthew Koop
-
OSU Network-Based Computing Lab



> [root at koa cpi]# mpirun_mpd -np 2 ./cpi
> cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
> len_local' failed.
> [man_0]: application program exited abnormally with status 0
> [man_0]: application program signaled with signal 6 (: Aborted)
> cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
> len_local' failed.
>
>
> In the mvapich2 case, the failure is:
>
> [root at koa cpi2]# mpiexec -n 1 ./cpi
> Process 0 of 1 is on koa.az05.bull.com
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000702
> [root at koa cpi2]# mpiexec -1 -n 1 ./cpi
> Process 0 of 1 is on jatoba.az05.bull.com
> pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> wall clock time = 0.000622
> [root at koa cpi2]# mpiexec -n 2 ./cpi
> [rdma_iba_priv.c:564]: PMI_KVS_Get error
>
> aborting job:
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread(188): Initialization failed
> MPID_Init(118): channel initialization failed
> MPIDI_CH3_Init(87): Channel init failed
> MPIDI_CH3I_RMDA_init(481): PMI_KVS_Get returned -1
> rank 1 in job 5  koa.az05.bull.com_60194   caused collective abort of all
> ranks
>   exit status of rank 1: killed by signal 9
> rank 0 in job 5  koa.az05.bull.com_60194   caused collective abort of all
> ranks
>   exit status of rank 0: killed by signal 9
>
>
> Does anyone have any ideas about this?
>
> -Don Albert-
> Bull HN Information Systems
>






More information about the general mailing list