[ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes

Tang, Changqing changquing.tang at hp.com
Thu Jul 5 19:10:48 PDT 2007


 
> >	However, if we config both ib0 and ib1 on the same network 
> >(172.200.0.x, 255.255.255.0), uDAPL works if all ranks use 
> ib0, uDAPL 
> >fails if all ranks use ib1 with error
> >code:
> >	DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after
> >dat_connect() and dat_evd_wait())
> >
> >The same error message if some ranks use ib0, some ranks use ib1.
> >  
> >
> 
> What does your /etc/dat.conf look like? What is the listening 
> port on each interface and what address/port are  you using 
> for each connection?

/etc/dat.conf is the default file after installation:

OpenIB-cma u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib1 0" ""
OpenIB-cma-2 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib2 0" ""
OpenIB-cma-3 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib3 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "bond0 0" ""

however, we only configure ib0 and ib1:

mpixbl05:/nis.home/ctang:/sbin/ifconfig

ib0       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.5  Bcast:172.200.0.255  Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:2118 errors:0 dropped:0 overruns:0 frame:0
          TX packets:84 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:217135 (212.0 KiB)  TX bytes:10854 (10.5 KiB)

ib1       Link encap:InfiniBand  HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:172.200.0.11  Bcast:172.200.0.255
Mask:255.255.255.0
          inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:2090 errors:0 dropped:0 overruns:0 frame:0
          TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:128 
          RX bytes:215361 (210.3 KiB)  TX bytes:9072 (8.8 KiB)

The listening port (conn_qual) is 1024 for the first rank using first
card (ib0), 
and 1025 for the second rank using second card (ib1). address is the
"ia_attr->ia_address_ptr"

Eventhough I force all ranks only using the first card (ib0), it works
for a while and 
then fails with NON_PEER_REJECTED when one rank tries to connect to
another rank (dat_connect()
and dat_evd_wait()). (I run a simple MPI job in an infinite loop, it
fails after hundreds runs);


> 
> Also, can you run ucmatose to verify rdma_cma is working 
> correctly across each interface?

It works on the first card (ib0), failed on the second card (ib1)

on mpixbl05, ib0 is "net addr:172.200.0.5  Bcast:172.200.0.255
Mask:255.255.255.0"
ib1 is "inet addr:172.200.0.11  Bcast:172.200.0.255  Mask:255.255.255.0

from mpixbl06, I can ping both IPs:

mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.11 
PING 172.200.0.11 (172.200.0.11) 56(84) bytes of data.
64 bytes from 172.200.0.11: icmp_seq=1 ttl=64 time=3.50 ms
64 bytes from 172.200.0.11: icmp_seq=2 ttl=64 time=0.034 ms
64 bytes from 172.200.0.11: icmp_seq=3 ttl=64 time=0.029 ms

--- 172.200.0.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt
min/avg/max/mdev = 0.029/1.189/3.504/1.636 ms
mpixbl06:/net/mpixbl06/lscratch/ctang/test:


mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.5 
PING 172.200.0.5 (172.200.0.5) 56(84) bytes of data.
64 bytes from 172.200.0.5: icmp_seq=1 ttl=64 time=0.772 ms
64 bytes from 172.200.0.5: icmp_seq=2 ttl=64 time=0.038 ms
64 bytes from 172.200.0.5: icmp_seq=3 ttl=64 time=0.030 ms

--- 172.200.0.5 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt
min/avg/max/mdev = 0.030/0.280/0.772/0.347 ms
mpixbl06:/net/mpixbl06/lscratch/ctang/test:

But ucmatose works on ib0:

mpixbl05:/nis.home/ctang:ucmatose -b 172.200.0.5
cmatose: starting server
initiating data transfers
completing sends
receiving data transfers
data transfers complete
cmatose: disconnecting
disconnected
test complete
return status 0
mpixbl05:/nis.home/ctang:

mpixbl06:/lscratch/ctang/mpi2251:ucmatose -s 172.200.0.5
cmatose: starting client
cmatose: connecting
receiving data transfers
sending replies
data transfers complete
test complete
return status 0
mpixbl06:/lscratch/ctang/mpi2251:

It fails on ib1:

mpixbl05:/net/mpixbl06/lscratch/ctang/test:ucmatose -b 172.200.0.11
cmatose: starting server


mpixbl06:/net/mpixbl06/lscratch/ctang/test:ucmatose -s 172.200.0.11
cmatose: starting client
cmatose: connecting
cmatose: event: 8, error: 0
receiving data transfers
sending replies
data transfers complete
test complete
return status 0
mpixbl06:/net/mpixbl06/lscratch/ctang/test:


--CQ


> 
> For example:
> 
> start a server on both interfaces (I am assuming 172.200.0.1 and
> 172.200.0.2)
> 
> ucmatose -b 172.200.0.1
> ucmatose -b 172.200.0.2
> 
> start a client on each interface on the other system
> 
> ucmatose -s 172.200.0.1
> ucmatose -s 172.200.0.2
> 
> Thanks,
> 
> -arlin
> 



More information about the general mailing list