[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2

Wed Mar 22 07:50:20 PST 2006

I have been struggling for several days trying to get either 
"mvapich-gen2" or "mvapich2-gen2" to work on a pair of EM64T machines.   I 
downloaded a week or so ago the latest Linux kernel (at that time 
2.6.15.6) and the then current gen2 tree (5685) and built the kernel with 
the appropriate links to the gen2 version.  That seems to be ok,  I can 
load all the kernel modules, and I can run various diagnostics like 
ibstat, ibnetdiscover, perfquery, etc.   So I have two systems that seem 
to be successfully connected through a central switch.

I then followed the procedures in the "mvapich_user_guide.pdf" for 
mvapich,  and the "mvapich2_user_guide.pdf" for mvapich2 to build the mpi 
portions of the gen2 tree.   In both cases, I built the software using the 
appropriate "make.mvapich.gen2" or "make.mvapich2.gen2" scripts.   I used 
the parameters "_PCI_EX_",  "_SDR_", "-DUSE_MPD_RING",  and for mvapich2 
also "_SMALL_CLUSTER".    The builds and installs seemed to go ok.

For both cases,  I can set up a ring with MPD and run jobs.   If I run a 
simple command like "env" or "uname" I can get the responses from both 
systems.  However, when I try to run any jobs (like the simple "cpi.c" 
from the examples directory) that actually use MPI calls,  I run into 
problems.   I can run a single copy on either the local system or the 
remote system,  but when I try to run two or more,  they fail.   The 
failures seem to be different for mvapich vs mvapich2, but both have to do 
with the two processes communicating over MPI.

For the sample "cpi.c" job,  in each case I set up $PATH to point to the 
appropriate version and compiled the source with "mpicc cpi.c -o cpi". 
Then I set up the MPD ring and attempt to run the job.  In the examples 
below,  I ran the job first on the local system, then on the remote 
system, and then tried to run 2 processes. 

In the mvapich case, the failure is:

[root at koa cpi]# mpirun_mpd -np 1 ./cpi
Process 0 on koa.az05.bull.com
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000083
[root at koa cpi]# mpirun_mpd -1 -np 1 ./cpi
Process 0 on jatoba.az05.bull.com
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000103
[root at koa cpi]# mpirun_mpd -np 2 ./cpi
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote == 
len_local' failed.
[man_0]: application program exited abnormally with status 0
[man_0]: application program signaled with signal 6 (: Aborted)
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote == 
len_local' failed.

In the mvapich2 case, the failure is:

[root at koa cpi2]# mpiexec -n 1 ./cpi
Process 0 of 1 is on koa.az05.bull.com
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000702
[root at koa cpi2]# mpiexec -1 -n 1 ./cpi
Process 0 of 1 is on jatoba.az05.bull.com
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000622
[root at koa cpi2]# mpiexec -n 2 ./cpi
[rdma_iba_priv.c:564]: PMI_KVS_Get error

aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(188): Initialization failed
MPID_Init(118): channel initialization failed
MPIDI_CH3_Init(87): Channel init failed
MPIDI_CH3I_RMDA_init(481): PMI_KVS_Get returned -1
rank 1 in job 5  koa.az05.bull.com_60194   caused collective abort of all 
ranks
  exit status of rank 1: killed by signal 9
rank 0 in job 5  koa.az05.bull.com_60194   caused collective abort of all 
ranks
  exit status of rank 0: killed by signal 9

Does anyone have any ideas about this?

-Don Albert-
Bull HN Information Systems
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060322/b374f122/attachment.html>