[openib-general] [PATCH] uDAPL openib-cma provider - add support for IB_CM_REQ_OPTIONS

Arlin Davis ardavis at ichips.intel.com
Wed Jun 7 15:24:46 PDT 2006


Scott Weitzenkamp (sweitzen) wrote:

>Yes, the modules were loaded.
>
>Each of the 32 hosts had 3 IB ports up.  Does Intel MPI or uDAPL use
>multiple ports and/or multiple HCAs?
>
>I shut down all but one port on each host, and now Pallas is running
>better on the 32 nodes using Intel MPI 2.0.1.  HP MPI 2.2 started
>working too with Pallas too over uDAPL, so maybe this is a uDAPL issue?
>  
>
Can you tell me what adapters are installed (ibstat), how they are 
configured (ifconfig),  and what your dat.conf looks like? It sounds 
like a device mapping issue during the dat_ia_open() processing.

Multiple ports and HCAs should work fine but there is some care required 
in configuration of the dat.conf so you consitantly pick up the correct 
device across the cluster. Intel MPI will simply open a device based on 
the provider/device name (example: setenv 
I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and query dapl 
for the address to be used for connections. This line in the dat.conf 
will determine which library to load and which IB device to open and 
bind too. If you have the same exact configuration on each node and know 
that the ib0,ib1,ib2, etc will always come up in the same order then you 
can simply use the same netdev names across the cluster and use the same 
exact copy of dat.conf  on each node.

Here are the dat.conf options for OpenIB-cma configurations.

# For cma version you specify <ia_params> as:
#       network address, network hostname, or netdev name and 0 for port
#
# Simple (OpenIB-cma) default with netdev name provided first on list
# to enable use of same dat.conf version on all nodes
#
OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so mv_dapl.1.2 
"ib0 0" ""
OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
mv_dapl.1.2 "192.168.0.22 0" ""
OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
mv_dapl.1.2 "svr1-ib0 0" ""
OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so 
mv_dapl.1.2 "ib0 0" ""

Which type are you using? address, hostname, or netdev names?

Also, Intel MPI is sometimes too smart for its own good when opening 
rdma devices via uDAPL. If the open fails with the first rdma device 
specified in the dat.conf it will continue onto the next line until one 
is successfull. If all rdma devices fail it will then go onto the static 
device automatcally. This sometimes does more harm then good since one 
node could be failing over to the second device in your configuration 
and the other nodes are all on the first device. If they are all on the 
same subnet then it would work fine but if they are on different subnets 
then we would not be able to connect.

If you send me your configuration, we can set it up here and hopefully 
duplicate your error case.

-arlin






More information about the general mailing list