[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2
Don.Albert at Bull.com
Don.Albert at Bull.com
Wed Mar 22 07:50:20 PST 2006
I have been struggling for several days trying to get either
"mvapich-gen2" or "mvapich2-gen2" to work on a pair of EM64T machines. I
downloaded a week or so ago the latest Linux kernel (at that time
2.6.15.6) and the then current gen2 tree (5685) and built the kernel with
the appropriate links to the gen2 version. That seems to be ok, I can
load all the kernel modules, and I can run various diagnostics like
ibstat, ibnetdiscover, perfquery, etc. So I have two systems that seem
to be successfully connected through a central switch.
I then followed the procedures in the "mvapich_user_guide.pdf" for
mvapich, and the "mvapich2_user_guide.pdf" for mvapich2 to build the mpi
portions of the gen2 tree. In both cases, I built the software using the
appropriate "make.mvapich.gen2" or "make.mvapich2.gen2" scripts. I used
the parameters "_PCI_EX_", "_SDR_", "-DUSE_MPD_RING", and for mvapich2
also "_SMALL_CLUSTER". The builds and installs seemed to go ok.
For both cases, I can set up a ring with MPD and run jobs. If I run a
simple command like "env" or "uname" I can get the responses from both
systems. However, when I try to run any jobs (like the simple "cpi.c"
from the examples directory) that actually use MPI calls, I run into
problems. I can run a single copy on either the local system or the
remote system, but when I try to run two or more, they fail. The
failures seem to be different for mvapich vs mvapich2, but both have to do
with the two processes communicating over MPI.
For the sample "cpi.c" job, in each case I set up $PATH to point to the
appropriate version and compiled the source with "mpicc cpi.c -o cpi".
Then I set up the MPD ring and attempt to run the job. In the examples
below, I ran the job first on the local system, then on the remote
system, and then tried to run 2 processes.
In the mvapich case, the failure is:
[root at koa cpi]# mpirun_mpd -np 1 ./cpi
Process 0 on koa.az05.bull.com
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000083
[root at koa cpi]# mpirun_mpd -1 -np 1 ./cpi
Process 0 on jatoba.az05.bull.com
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000103
[root at koa cpi]# mpirun_mpd -np 2 ./cpi
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
len_local' failed.
[man_0]: application program exited abnormally with status 0
[man_0]: application program signaled with signal 6 (: Aborted)
cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion `len_remote ==
len_local' failed.
In the mvapich2 case, the failure is:
[root at koa cpi2]# mpiexec -n 1 ./cpi
Process 0 of 1 is on koa.az05.bull.com
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000702
[root at koa cpi2]# mpiexec -1 -n 1 ./cpi
Process 0 of 1 is on jatoba.az05.bull.com
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000622
[root at koa cpi2]# mpiexec -n 2 ./cpi
[rdma_iba_priv.c:564]: PMI_KVS_Get error
aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(188): Initialization failed
MPID_Init(118): channel initialization failed
MPIDI_CH3_Init(87): Channel init failed
MPIDI_CH3I_RMDA_init(481): PMI_KVS_Get returned -1
rank 1 in job 5 koa.az05.bull.com_60194 caused collective abort of all
ranks
exit status of rank 1: killed by signal 9
rank 0 in job 5 koa.az05.bull.com_60194 caused collective abort of all
ranks
exit status of rank 0: killed by signal 9
Does anyone have any ideas about this?
-Don Albert-
Bull HN Information Systems
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060322/b374f122/attachment.html>
More information about the general
mailing list