[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2
Weikuan Yu
yuw at cse.ohio-state.edu
Wed Mar 22 18:57:11 PST 2006
Don,
Good to know the information on the other node.
> Both the kernel and the openib software was compiled separately on
> each machine. The corresponding logs from 'jatoba' are attached
> below. None of the directories are shared. For compiling the "cpi.c"
> program, I compile it on each machine, but the directory structure is
> the same: i.e. the "cpi" executable is under
> /home/ib/test/mpi/cpi/cpi on each machine.
This is where the problem came from!
The differences between these two nodes are causing the same mvapich
source code to be configured differently, which is enough to cause the
incompatibilities at run time. The exact problem can be either because
of the linux installations (ie 32-bit mode or 64-bit on EM64T), or
because the libraries you installed are different. You can take a diff
from the two config-mine.log files you have. Amongst various
differences between them, one thing particularly important is the
different sizes of int, pointers and long, as shown by the following
portion.
++++++++++++++++++++
137,141c142,145
< checking for size of void *... unavailable
< checking for pointers greater than 32 bits... no
< checking for size of int... unavailable< checking for int large
enough for pointers... yes
< checking for size of void *... unavailable
---
> checking for size of void *... 8
> checking for pointers greater than 32 bits... yes
> checking for size of int... 4
> checking for int large enough for pointers... no
++++++++++++++++++++++
So taken this into consideration. Just be curious. Have been you able
to run some MPI implementations across these two nodes? Or mvapich with
mpirun_rsh instead of mpirun_mpd? It wouldn't be surprising if the
answer is no.
The size differences above lead to differences in many of the
structures. That is why you are not able to run either mvapich-gen2 or
mvapich2-gen2. In a little larger context, these two nodes can be
taken as a sample case of heterogeneous configurations. We have plans
to work out solutions for this kind of heterogeneity in
mvapich/mvapich2. It may take some more to get ready.
So that leaves the question about how to get these two nodes to be able
to run mvapich. I would suggest you first unify the system installation
on these two nodes. And then compile OpenIB/gen2 kernel/userspace on
one node and distribute to the other(s). Same thing for
building/installing/running mvapich/mvapich2.
Please keep us updated about how this gets solved at the end.
Thanks,
Weikuan
--
Weikuan Yu, Computer Science, OSU
http://www.cse.ohio-state.edu/~yuw
More information about the general
mailing list