[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2

Weikuan Yu yuw at cse.ohio-state.edu
Wed Mar 22 18:57:11 PST 2006


Don,

Good to know the information on the other node.

> Both the kernel and the openib software was compiled separately on 
> each machine.  The corresponding logs from 'jatoba' are attached 
> below.  None of the directories are shared.  For compiling the "cpi.c" 
> program, I compile it on each machine, but the directory structure is 
> the same:  i.e. the "cpi" executable is under 
> /home/ib/test/mpi/cpi/cpi on each machine.

This is where the problem came from!

The differences between these two nodes are causing the same mvapich 
source code to be configured differently, which is enough to cause the 
incompatibilities at run time. The exact problem can be either because 
of the linux installations (ie 32-bit mode or 64-bit on EM64T), or 
because the libraries you installed are different. You can take a diff 
from the two config-mine.log files you have. Amongst various 
differences between them, one thing particularly important is the 
different sizes of int, pointers and long, as shown by the following 
portion.

++++++++++++++++++++
137,141c142,145
< checking for size of void *... unavailable
< checking for pointers greater than 32 bits... no
< checking for size of int... unavailable< checking for int large 
enough for pointers... yes
< checking for size of void *... unavailable
---
 > checking for size of void *... 8
 > checking for pointers greater than 32 bits... yes
 > checking for size of int... 4
 > checking for int large enough for pointers... no
++++++++++++++++++++++

So taken this into consideration. Just be curious. Have been you able 
to run some MPI implementations across these two nodes? Or mvapich with 
mpirun_rsh instead of mpirun_mpd? It wouldn't be surprising if the 
answer is no.

The size differences above lead to differences in many of the 
structures. That is why you are not able to run either mvapich-gen2 or 
mvapich2-gen2. In a  little larger context, these two nodes can be 
taken as a sample case of heterogeneous configurations. We have plans 
to work out solutions for this kind of heterogeneity in 
mvapich/mvapich2. It may take some more to get ready.

So that leaves the question about how to get these two nodes to be able 
to run mvapich. I would suggest you first unify the system installation 
on these two nodes. And then compile OpenIB/gen2 kernel/userspace on 
one node and distribute to the other(s). Same thing for 
building/installing/running mvapich/mvapich2.

Please keep us updated about how this gets solved at the end.

Thanks,
Weikuan
--
Weikuan Yu, Computer Science, OSU
http://www.cse.ohio-state.edu/~yuw




More information about the general mailing list