[openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS
Scott Weitzenkamp (sweitzen)
sweitzen at cisco.com
Wed Jun 7 15:44:42 PDT 2006
I have not touched /etc/dat.conf, so I am using whatever comes with OFED
1.0 rc5.
For whatever reason, things have improved some. I am now running Intel
MPI right after bringing up hosts (previously I was trying MVAPICH, then
Open MPI, then HP MPI, then Intel MPI). I've run twice, and see these
failures:
Run #1 (after rebooting all hosts):
rank 13 in job 1 192.168.1.1_34674 caused collective abort of all
ranks^M
exit status of rank 13: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_123945/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_123945/intel.intel/1149709233/IMB_2.3/src/IMB-MPI1
Allreduce : 0\
Run #2 (after rebooting all hosts):
rank 6 in job 1 192.168.1.1_33649 caused collective abort of all
ranks^M
exit status of rank 6: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1
Exchange : 0
rank 21 in job 1 192.168.1.1_34734 caused collective abort of all
ranks^M
exit status of rank 21: killed by signal 11 ^M
^[_releng at svbu-qaclus-1:/data/home/scott/builds/TopspinOS-2.7.0/build013
/protes\
t/Lk3/060706_145739/intel.intel^[\[releng at svbu-qaclus-1 intel.intel]$
### TEST-W: Could not run
/data/home/scott/builds/TopspinOS-2.7.0/build013/prot\
est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1
Allgatherrv -\
multi 1: 0
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
> -----Original Message-----
> From: Arlin Davis [mailto:ardavis at ichips.intel.com]
> Sent: Wednesday, June 07, 2006 3:25 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Davis, Arlin R; Lentini, James; openib-general
> Subject: Re: [openib-general] [PATCH] uDAPL openib-cma
> provider - add support for IB_CM_REQ_OPTIONS
>
> Scott Weitzenkamp (sweitzen) wrote:
>
> >Yes, the modules were loaded.
> >
> >Each of the 32 hosts had 3 IB ports up. Does Intel MPI or uDAPL use
> >multiple ports and/or multiple HCAs?
> >
> >I shut down all but one port on each host, and now Pallas is running
> >better on the 32 nodes using Intel MPI 2.0.1. HP MPI 2.2 started
> >working too with Pallas too over uDAPL, so maybe this is a
> uDAPL issue?
> >
> >
> Can you tell me what adapters are installed (ibstat), how they are
> configured (ifconfig), and what your dat.conf looks like? It sounds
> like a device mapping issue during the dat_ia_open() processing.
>
> Multiple ports and HCAs should work fine but there is some
> care required
> in configuration of the dat.conf so you consitantly pick up
> the correct
> device across the cluster. Intel MPI will simply open a
> device based on
> the provider/device name (example: setenv
> I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and
> query dapl
> for the address to be used for connections. This line in the dat.conf
> will determine which library to load and which IB device to open and
> bind too. If you have the same exact configuration on each
> node and know
> that the ib0,ib1,ib2, etc will always come up in the same
> order then you
> can simply use the same netdev names across the cluster and
> use the same
> exact copy of dat.conf on each node.
>
> Here are the dat.conf options for OpenIB-cma configurations.
>
> # For cma version you specify <ia_params> as:
> # network address, network hostname, or netdev name and
> 0 for port
> #
> # Simple (OpenIB-cma) default with netdev name provided first on list
> # to enable use of same dat.conf version on all nodes
> #
> OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so
> mv_dapl.1.2
> "ib0 0" ""
> OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so
> mv_dapl.1.2 "192.168.0.22 0" ""
> OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so
> mv_dapl.1.2 "svr1-ib0 0" ""
> OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so
> mv_dapl.1.2 "ib0 0" ""
>
> Which type are you using? address, hostname, or netdev names?
>
> Also, Intel MPI is sometimes too smart for its own good when opening
> rdma devices via uDAPL. If the open fails with the first rdma device
> specified in the dat.conf it will continue onto the next line
> until one
> is successfull. If all rdma devices fail it will then go onto
> the static
> device automatcally. This sometimes does more harm then good
> since one
> node could be failing over to the second device in your configuration
> and the other nodes are all on the first device. If they are
> all on the
> same subnet then it would work fine but if they are on
> different subnets
> then we would not be able to connect.
>
> If you send me your configuration, we can set it up here and
> hopefully
> duplicate your error case.
>
> -arlin
>
More information about the general
mailing list