[openib-general] Problems running MPI jobs with MVAPICH

Weikuan Yu yuw at cse.ohio-state.edu
Thu Apr 6 17:53:06 PDT 2006


Hi, Don,

Good to know that you are able to run mvapich with mpirun_rsh. We can 
now focus on MPD problem. We never had attempted to run MPD_RING option 
as root user. Just curious, were you able to mvapich2-gen2 with 
MPD_RING? They are more or less similar code. So could you try the 
following two possibilities and let us know all the log files and etc.

a) rpm -e lam.
The reason for this is that I noticed earlier LAM showing up in your 
config.log. It might help the configure if you can remove the other MPI 
packages which are on your path.
b) Try mvapich-gen2 with mpd_ring, either as root or as user. Please do 
build/configure/install on one node and propagate the installation to 
see if it runs. We can look into the separate build later on. BTW, make 
sure you do `make install' at the end of configure/build.
c) If possible, could you try mvapich2-gen2 with mpd_ring since the 
mpd_ring related code is similar there. That may help to locate the 
problem.

Thanks,
Weikuan


On Apr 6, 2006, at 8:02 PM, Don.Albert at Bull.com wrote:

>
> Weikuan
>
> I previously reported that I was having problems running any MPI jobs 
> between a pair of EM64T machines with RHEL4, Update 3 with the OpenIB 
> modules,  (kernel versions 2.6.9-34.ELsmp) and the "mvapich-gen2" code 
> from the OpenIB svn tree.     I was having two problems:
>
> 	1.  	When I tried to run from user mode,  I would get segmentation 
> faults
>
> 	2.  	When I ran from root,  the jobs would fail with the following 
> message:   "cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion 
> `len_remote == len_local' failed. ".
>
> The first problem turned out to be a memory problem;  I had to 
> increase the size of the max locked-in-memory address space (memlock) 
> in the user limits.
>
> The second problem seemed to be more related to process management 
> than to MPI itself.   I remembered that when I modified the 
> "make.mvapich.gen2" build script,  there was a parameter for MPD:
>
>   # Whether to use an optimized queue pair exchange scheme.  This is 
> not
>   # checked for a setting in in the script.  It must be set here 
> explicitly.
>   # Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable)
>   HAVE_MPD_RING=""
>
> Because I wanted to use MPD to launch jobs,  I set   
> HAVE_MPD_RING="-DUSE_MPD_RING"  in the build script.
>
> I went back and set the parameter to HAVE_MPD_RING="" to disable it, 
> and rebuilt, which meant that MPD was not installed.   Using 
> "mpirun_rsh" I am now able to run the MPI jobs,  including "cpi", 
> "mping" and other benchmark tests.
>
> There seems to be a problem with "USE_MPD_RING".    Have you seen this 
> before?   Should I try with "USE_MPD_BASIC" instead?
>
>         -Don Albert-
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
--
Weikuan Yu, Computer Science, OSU
http://www.cse.ohio-state.edu/~yuw




More information about the general mailing list