[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2

Don.Albert at Bull.com Don.Albert at Bull.com
Wed Mar 22 20:37:05 PST 2006


Weikuan,

Wow!  Thanks for the analysis!   I knew there were some differences in the 
controller boards and PCI bus layout between the two machines, but I never 
would have guessed that the basic geometry like pointer sizes was set up 
differently!    I will have to dig into the past history of these two 
machines a bit.  I know that "koa" at least had both RedHat and Suse 
distributions installed at one time or another,  but I am not sure about 
"jatoba".

You are also correct that I could not get any version of mpi to run 
between the two machines.

Thanks, again!

  -Don Albert-


Weikuan Yu <yuw at cse.ohio-state.edu> wrote on 03/22/2006 07:57:11 PM:

> Don,
> 
> Good to know the information on the other node.
> 
> > Both the kernel and the openib software was compiled separately on 
> > each machine.  The corresponding logs from 'jatoba' are attached 
> > below.  None of the directories are shared.  For compiling the "cpi.c" 

> > program, I compile it on each machine, but the directory structure is 
> > the same:  i.e. the "cpi" executable is under 
> > /home/ib/test/mpi/cpi/cpi on each machine.
> 
> This is where the problem came from!
> 
> The differences between these two nodes are causing the same mvapich 
> source code to be configured differently, which is enough to cause the 
> incompatibilities at run time. The exact problem can be either because 
> of the linux installations (ie 32-bit mode or 64-bit on EM64T), or 
> because the libraries you installed are different. You can take a diff 
> from the two config-mine.log files you have. Amongst various 
> differences between them, one thing particularly important is the 
> different sizes of int, pointers and long, as shown by the following 
> portion.
> 
> ++++++++++++++++++++
> 137,141c142,145
> < checking for size of void *... unavailable
> < checking for pointers greater than 32 bits... no
> < checking for size of int... unavailable< checking for int large 
> enough for pointers... yes
> < checking for size of void *... unavailable
> ---
>  > checking for size of void *... 8
>  > checking for pointers greater than 32 bits... yes
>  > checking for size of int... 4
>  > checking for int large enough for pointers... no
> ++++++++++++++++++++++
> 
> So taken this into consideration. Just be curious. Have been you able 
> to run some MPI implementations across these two nodes? Or mvapich with 
> mpirun_rsh instead of mpirun_mpd? It wouldn't be surprising if the 
> answer is no.
> 
> The size differences above lead to differences in many of the 
> structures. That is why you are not able to run either mvapich-gen2 or 
> mvapich2-gen2. In a  little larger context, these two nodes can be 
> taken as a sample case of heterogeneous configurations. We have plans 
> to work out solutions for this kind of heterogeneity in 
> mvapich/mvapich2. It may take some more to get ready.
> 
> So that leaves the question about how to get these two nodes to be able 
> to run mvapich. I would suggest you first unify the system installation 
> on these two nodes. And then compile OpenIB/gen2 kernel/userspace on 
> one node and distribute to the other(s). Same thing for 
> building/installing/running mvapich/mvapich2.
> 
> Please keep us updated about how this gets solved at the end.
> 
> Thanks,
> Weikuan
> --
> Weikuan Yu, Computer Science, OSU
> http://www.cse.ohio-state.edu/~yuw
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060322/2d677850/attachment.html>


More information about the general mailing list