[openib-general] Problems running MPI jobs with MVAPICH

Weikuan Yu yuw at cse.ohio-state.edu
Thu Apr 6 18:34:16 PDT 2006


A quick followup. I have just build/configure and propagated the 
mvapich-gen2 installation on two EM64T nodes as root. mvapich-gen2 runs 
fine with MPD_RING option. Here are the commands I had used. Hope they 
could help.

1) prepare your mpd passwd/conf files: /root/.mpdpasswd and 
/root/.mpd.conf, they should be the same with mode 600
[root at e14-oib mvapich-gen2]# cat /root/.mpd.conf
password=56rtG9

2) make.mvapich.gen2 # select /root/installs as $PREFIX and add 
USE_MPD_RING into the option.
[root at e14-oib mvapich-gen2]# scp /root/installs e15:/root/.

3) [root at e14-oib mvapich-gen2]# /root/installs/bin/mpicc -o /root/cpi 
examples/basic/cpi.c
[root at e14-oib mvapich-gen2]# scp /root/cpi e15:/root/.
cpi                                           100%  294KB 293.8KB/s   
00:00

4) Some system info
[root at e14-oib mvapich-gen2]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 2)
[root at e14-oib mvapich-gen2]# uname -a
Linux e14-oib 2.6.15 #3 SMP Mon Mar 6 20:48:17 PST 2006 x86_64 x86_64 
x86_64 GNU/Linux
[root at e14-oib mvapich-gen2]# LD_LIBRARY_PATH=/usr/local/lib 
/root/installs/bin/mpdtrace
mpdtrace: e14-oib_43520:  lhs=e15-oib_60830  rhs=e15-oib_60830  
rhs2=e14-oib_43520 gen=1
mpdtrace: e15-oib_60830:  lhs=e14-oib_43520  rhs=e14-oib_43520  
rhs2=e15-oib_60830 gen=1

5) running two processes on one or two nodes.
[root at e14-oib mvapich-gen2]# LD_LIBRARY_PATH=/usr/local/lib 
/root/installs/bin/mpirun_mpd -np 2 /root/cpi -MPDENV- 
LD_LIBRARY_PATH=/usr/local/lib
Process 0 of 2 on e14-oib
Process 1 of 2 on e15-oib
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000424
[root at e14-oib mvapich-gen2]# LD_LIBRARY_PATH=/usr/local/lib 
/root/installs/bin/mpirun_mpd -g 2 -np 2 /root/cpi -MPDENV- 
LD_LIBRARY_PATH=/usr/local/lib
Process 0 of 2 on e14-oib
Process 1 of 2 on e14-oib
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000406

Let us know how we can help further,

Weikuan


On Apr 6, 2006, at 8:53 PM, Weikuan Yu wrote:

> Hi, Don,
>
> Good to know that you are able to run mvapich with mpirun_rsh. We can 
> now focus on MPD problem. We never had attempted to run MPD_RING 
> option as root user. Just curious, were you able to mvapich2-gen2 with 
> MPD_RING? They are more or less similar code. So could you try the 
> following two possibilities and let us know all the log files and etc.
>
> a) rpm -e lam.
> The reason for this is that I noticed earlier LAM showing up in your 
> config.log. It might help the configure if you can remove the other 
> MPI packages which are on your path.
> b) Try mvapich-gen2 with mpd_ring, either as root or as user. Please 
> do build/configure/install on one node and propagate the installation 
> to see if it runs. We can look into the separate build later on. BTW, 
> make sure you do `make install' at the end of configure/build.
> c) If possible, could you try mvapich2-gen2 with mpd_ring since the 
> mpd_ring related code is similar there. That may help to locate the 
> problem.
>
> Thanks,
> Weikuan
>
>
> On Apr 6, 2006, at 8:02 PM, Don.Albert at Bull.com wrote:
>
>>
>> Weikuan
>>
>> I previously reported that I was having problems running any MPI jobs 
>> between a pair of EM64T machines with RHEL4, Update 3 with the OpenIB 
>> modules,  (kernel versions 2.6.9-34.ELsmp) and the "mvapich-gen2" 
>> code from the OpenIB svn tree.     I was having two problems:
>>
>> 	1.  	When I tried to run from user mode,  I would get segmentation 
>> faults
>>
>> 	2.  	When I ran from root,  the jobs would fail with the following 
>> message:   "cpi: pmgr_client_mpd.c:254: mpd_exchange_info: Assertion 
>> `len_remote == len_local' failed. ".
>>
>> The first problem turned out to be a memory problem;  I had to 
>> increase the size of the max locked-in-memory address space (memlock) 
>> in the user limits.
>>
>> The second problem seemed to be more related to process management 
>> than to MPI itself.   I remembered that when I modified the 
>> "make.mvapich.gen2" build script,  there was a parameter for MPD:
>>
>>   # Whether to use an optimized queue pair exchange scheme.  This is 
>> not
>>   # checked for a setting in in the script.  It must be set here 
>> explicitly.
>>   # Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable)
>>   HAVE_MPD_RING=""
>>
>> Because I wanted to use MPD to launch jobs,  I set   
>> HAVE_MPD_RING="-DUSE_MPD_RING"  in the build script.
>>
>> I went back and set the parameter to HAVE_MPD_RING="" to disable it, 
>> and rebuilt, which meant that MPD was not installed.   Using 
>> "mpirun_rsh" I am now able to run the MPI jobs,  including "cpi", 
>> "mping" and other benchmark tests.
>>
>> There seems to be a problem with "USE_MPD_RING".    Have you seen 
>> this before?   Should I try with "USE_MPD_BASIC" instead?
>>
>>         -Don Albert-
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
> --
> Weikuan Yu, Computer Science, OSU
> http://www.cse.ohio-state.edu/~yuw
>
>
--
Weikuan Yu, Computer Science, OSU
http://www.cse.ohio-state.edu/~yuw




More information about the general mailing list