[openib-general] Problems running MPI jobs with MVAPICH and MVAPICH2
Don.Albert at Bull.com
Don.Albert at Bull.com
Wed Mar 22 20:37:05 PST 2006
Weikuan,
Wow! Thanks for the analysis! I knew there were some differences in the
controller boards and PCI bus layout between the two machines, but I never
would have guessed that the basic geometry like pointer sizes was set up
differently! I will have to dig into the past history of these two
machines a bit. I know that "koa" at least had both RedHat and Suse
distributions installed at one time or another, but I am not sure about
"jatoba".
You are also correct that I could not get any version of mpi to run
between the two machines.
Thanks, again!
-Don Albert-
Weikuan Yu <yuw at cse.ohio-state.edu> wrote on 03/22/2006 07:57:11 PM:
> Don,
>
> Good to know the information on the other node.
>
> > Both the kernel and the openib software was compiled separately on
> > each machine. The corresponding logs from 'jatoba' are attached
> > below. None of the directories are shared. For compiling the "cpi.c"
> > program, I compile it on each machine, but the directory structure is
> > the same: i.e. the "cpi" executable is under
> > /home/ib/test/mpi/cpi/cpi on each machine.
>
> This is where the problem came from!
>
> The differences between these two nodes are causing the same mvapich
> source code to be configured differently, which is enough to cause the
> incompatibilities at run time. The exact problem can be either because
> of the linux installations (ie 32-bit mode or 64-bit on EM64T), or
> because the libraries you installed are different. You can take a diff
> from the two config-mine.log files you have. Amongst various
> differences between them, one thing particularly important is the
> different sizes of int, pointers and long, as shown by the following
> portion.
>
> ++++++++++++++++++++
> 137,141c142,145
> < checking for size of void *... unavailable
> < checking for pointers greater than 32 bits... no
> < checking for size of int... unavailable< checking for int large
> enough for pointers... yes
> < checking for size of void *... unavailable
> ---
> > checking for size of void *... 8
> > checking for pointers greater than 32 bits... yes
> > checking for size of int... 4
> > checking for int large enough for pointers... no
> ++++++++++++++++++++++
>
> So taken this into consideration. Just be curious. Have been you able
> to run some MPI implementations across these two nodes? Or mvapich with
> mpirun_rsh instead of mpirun_mpd? It wouldn't be surprising if the
> answer is no.
>
> The size differences above lead to differences in many of the
> structures. That is why you are not able to run either mvapich-gen2 or
> mvapich2-gen2. In a little larger context, these two nodes can be
> taken as a sample case of heterogeneous configurations. We have plans
> to work out solutions for this kind of heterogeneity in
> mvapich/mvapich2. It may take some more to get ready.
>
> So that leaves the question about how to get these two nodes to be able
> to run mvapich. I would suggest you first unify the system installation
> on these two nodes. And then compile OpenIB/gen2 kernel/userspace on
> one node and distribute to the other(s). Same thing for
> building/installing/running mvapich/mvapich2.
>
> Please keep us updated about how this gets solved at the end.
>
> Thanks,
> Weikuan
> --
> Weikuan Yu, Computer Science, OSU
> http://www.cse.ohio-state.edu/~yuw
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060322/2d677850/attachment.html>
More information about the general
mailing list