[ofa-general] [GIT PULL] OFED 1.2 uDAPL release notes
Tang, Changqing
changquing.tang at hp.com
Thu Jul 5 19:10:48 PDT 2007
> > However, if we config both ib0 and ib1 on the same network
> >(172.200.0.x, 255.255.255.0), uDAPL works if all ranks use
> ib0, uDAPL
> >fails if all ranks use ib1 with error
> >code:
> > DAT_CONNECTION_EVENT_NON_PEER_REJECTED 0x4003 (after
> >dat_connect() and dat_evd_wait())
> >
> >The same error message if some ranks use ib0, some ranks use ib1.
> >
> >
>
> What does your /etc/dat.conf look like? What is the listening
> port on each interface and what address/port are you using
> for each connection?
/etc/dat.conf is the default file after installation:
OpenIB-cma u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib1 0" ""
OpenIB-cma-2 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib2 0" ""
OpenIB-cma-3 u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "ib3 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/ofed/lib64/libdaplcma.so
dapl.1.2 "bond0 0" ""
however, we only configure ib0 and ib1:
mpixbl05:/nis.home/ctang:/sbin/ifconfig
ib0 Link encap:InfiniBand HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.200.0.5 Bcast:172.200.0.255 Mask:255.255.255.0
inet6 addr: fe80::219:bbff:fff7:ace5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:2118 errors:0 dropped:0 overruns:0 frame:0
TX packets:84 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:217135 (212.0 KiB) TX bytes:10854 (10.5 KiB)
ib1 Link encap:InfiniBand HWaddr
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:172.200.0.11 Bcast:172.200.0.255
Mask:255.255.255.0
inet6 addr: fe80::219:bbff:fff7:6ba9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:2090 errors:0 dropped:0 overruns:0 frame:0
TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:215361 (210.3 KiB) TX bytes:9072 (8.8 KiB)
The listening port (conn_qual) is 1024 for the first rank using first
card (ib0),
and 1025 for the second rank using second card (ib1). address is the
"ia_attr->ia_address_ptr"
Eventhough I force all ranks only using the first card (ib0), it works
for a while and
then fails with NON_PEER_REJECTED when one rank tries to connect to
another rank (dat_connect()
and dat_evd_wait()). (I run a simple MPI job in an infinite loop, it
fails after hundreds runs);
>
> Also, can you run ucmatose to verify rdma_cma is working
> correctly across each interface?
It works on the first card (ib0), failed on the second card (ib1)
on mpixbl05, ib0 is "net addr:172.200.0.5 Bcast:172.200.0.255
Mask:255.255.255.0"
ib1 is "inet addr:172.200.0.11 Bcast:172.200.0.255 Mask:255.255.255.0
from mpixbl06, I can ping both IPs:
mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.11
PING 172.200.0.11 (172.200.0.11) 56(84) bytes of data.
64 bytes from 172.200.0.11: icmp_seq=1 ttl=64 time=3.50 ms
64 bytes from 172.200.0.11: icmp_seq=2 ttl=64 time=0.034 ms
64 bytes from 172.200.0.11: icmp_seq=3 ttl=64 time=0.029 ms
--- 172.200.0.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms rtt
min/avg/max/mdev = 0.029/1.189/3.504/1.636 ms
mpixbl06:/net/mpixbl06/lscratch/ctang/test:
mpixbl06:/net/mpixbl06/lscratch/ctang/test:ping 172.200.0.5
PING 172.200.0.5 (172.200.0.5) 56(84) bytes of data.
64 bytes from 172.200.0.5: icmp_seq=1 ttl=64 time=0.772 ms
64 bytes from 172.200.0.5: icmp_seq=2 ttl=64 time=0.038 ms
64 bytes from 172.200.0.5: icmp_seq=3 ttl=64 time=0.030 ms
--- 172.200.0.5 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt
min/avg/max/mdev = 0.030/0.280/0.772/0.347 ms
mpixbl06:/net/mpixbl06/lscratch/ctang/test:
But ucmatose works on ib0:
mpixbl05:/nis.home/ctang:ucmatose -b 172.200.0.5
cmatose: starting server
initiating data transfers
completing sends
receiving data transfers
data transfers complete
cmatose: disconnecting
disconnected
test complete
return status 0
mpixbl05:/nis.home/ctang:
mpixbl06:/lscratch/ctang/mpi2251:ucmatose -s 172.200.0.5
cmatose: starting client
cmatose: connecting
receiving data transfers
sending replies
data transfers complete
test complete
return status 0
mpixbl06:/lscratch/ctang/mpi2251:
It fails on ib1:
mpixbl05:/net/mpixbl06/lscratch/ctang/test:ucmatose -b 172.200.0.11
cmatose: starting server
mpixbl06:/net/mpixbl06/lscratch/ctang/test:ucmatose -s 172.200.0.11
cmatose: starting client
cmatose: connecting
cmatose: event: 8, error: 0
receiving data transfers
sending replies
data transfers complete
test complete
return status 0
mpixbl06:/net/mpixbl06/lscratch/ctang/test:
--CQ
>
> For example:
>
> start a server on both interfaces (I am assuming 172.200.0.1 and
> 172.200.0.2)
>
> ucmatose -b 172.200.0.1
> ucmatose -b 172.200.0.2
>
> start a client on each interface on the other system
>
> ucmatose -s 172.200.0.1
> ucmatose -s 172.200.0.2
>
> Thanks,
>
> -arlin
>
More information about the general
mailing list