[openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS

Scott Weitzenkamp (sweitzen) sweitzen at cisco.com
Wed Jun 7 15:44:42 PDT 2006


I have not touched /etc/dat.conf, so I am using whatever comes with OFED
1.0 rc5.

For whatever reason, things have improved some.  I am now running Intel
MPI right after bringing up hosts (previously I was trying MVAPICH, then
Open MPI, then HP MPI, then Intel MPI).  I've run twice, and see these
failures:

Run #1 (after rebooting all hosts):

rank 13 in job 1  192.168.1.1_34674   caused collective abort of all
ranks^M
  exit status of rank 13: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_123945/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_123945/intel.intel/1149709233/IMB_2.3/src/IMB-MPI1
Allreduce : 0\

Run #2 (after rebooting all hosts):

rank 6 in job 1  192.168.1.1_33649   caused collective abort of all
ranks^M
  exit status of rank 6: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1
Exchange : 0

rank 21 in job 1  192.168.1.1_34734   caused collective abort of all
ranks^M
  exit status of rank 21: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1
Allgatherrv -\
multi 1: 0

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Arlin Davis [mailto:ardavis at ichips.intel.com] 
> Sent: Wednesday, June 07, 2006 3:25 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Davis, Arlin R; Lentini, James; openib-general
> Subject: Re: [openib-general] [PATCH] uDAPL openib-cma 
> provider - add support for IB_CM_REQ_OPTIONS
> 
> Scott Weitzenkamp (sweitzen) wrote:
> 
> >Yes, the modules were loaded.
> >
> >Each of the 32 hosts had 3 IB ports up.  Does Intel MPI or uDAPL use
> >multiple ports and/or multiple HCAs?
> >
> >I shut down all but one port on each host, and now Pallas is running
> >better on the 32 nodes using Intel MPI 2.0.1.  HP MPI 2.2 started
> >working too with Pallas too over uDAPL, so maybe this is a 
> uDAPL issue?
> >  
> >
> Can you tell me what adapters are installed (ibstat), how they are 
> configured (ifconfig),  and what your dat.conf looks like? It sounds 
> like a device mapping issue during the dat_ia_open() processing.
> 
> Multiple ports and HCAs should work fine but there is some 
> care required 
> in configuration of the dat.conf so you consitantly pick up 
> the correct 
> device across the cluster. Intel MPI will simply open a 
> device based on 
> the provider/device name (example: setenv 
> I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and 
> query dapl 
> for the address to be used for connections. This line in the dat.conf 
> will determine which library to load and which IB device to open and 
> bind too. If you have the same exact configuration on each 
> node and know 
> that the ib0,ib1,ib2, etc will always come up in the same 
> order then you 
> can simply use the same netdev names across the cluster and 
> use the same 
> exact copy of dat.conf  on each node.
> 
> Here are the dat.conf options for OpenIB-cma configurations.
> 
> # For cma version you specify <ia_params> as:
> #       network address, network hostname, or netdev name and 
> 0 for port
> #
> # Simple (OpenIB-cma) default with netdev name provided first on list
> # to enable use of same dat.conf version on all nodes
> #
> OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 
> "ib0 0" ""
> OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 "192.168.0.22 0" ""
> OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 "svr1-ib0 0" ""
> OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
> mv_dapl.1.2 "ib0 0" ""
> 
> Which type are you using? address, hostname, or netdev names?
> 
> Also, Intel MPI is sometimes too smart for its own good when opening 
> rdma devices via uDAPL. If the open fails with the first rdma device 
> specified in the dat.conf it will continue onto the next line 
> until one 
> is successfull. If all rdma devices fail it will then go onto 
> the static 
> device automatcally. This sometimes does more harm then good 
> since one 
> node could be failing over to the second device in your configuration 
> and the other nodes are all on the first device. If they are 
> all on the 
> same subnet then it would work fine but if they are on 
> different subnets 
> then we would not be able to connect.
> 
> If you send me your configuration, we can set it up here and 
> hopefully 
> duplicate your error case.
> 
> -arlin
> 




More information about the general mailing list