[ewg] ib_acme fails for requests with IPv4 addresses (ofed 3.5)

Jens Domke jens.domke at tu-dresden.de
Thu Mar 21 06:53:51 PDT 2013


Hello,

I would like to use IBACM in combination with OpenMPI. Unfortunately, librdmacm does not return the correct information.

We use 2 nodes of a small cluster (18 nodes + 1 switch) for the tests. Both nodes are configured for IPoIB, and ibacm is runing on each of them.
If we use the fabric without IPoIB everything works. Also with IPoIB on the 2 nodes, we are able to run ssh, ping, rping, traceroute, ibv_rc_pingpong between the ib0 devices.
Here the ib0 configuration of the nodes:
=================================================================================
rc002 ~/ $ ifconfig ib0
ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:02:00:00:00:00:00:00:00:00:00:00:00:00:00
         inet addr:10.0.0.51  Bcast:10.0.0.255  Mask:255.255.255.0
         inet6 addr: fe80::208:f104:399:ebb5/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:602 errors:0 dropped:0 overruns:0 frame:0
         TX packets:164 errors:0 dropped:13 overruns:0 carrier:0
         collisions:0 txqueuelen:256
         RX bytes:47360 (46.2 KiB)  TX bytes:25051 (24.4 KiB)

rc003 ~/ $ ifconfig ib0
ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:02:00:00:00:00:00:00:00:00:00:00:00:00:00
         inet addr:10.0.0.52  Bcast:10.0.0.255  Mask:255.255.255.0
         inet6 addr: fe80::208:f104:399:ecd5/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:463 errors:0 dropped:0 overruns:0 frame:0
         TX packets:131 errors:0 dropped:15 overruns:0 carrier:0
         collisions:0 txqueuelen:256
         RX bytes:35971 (35.1 KiB)  TX bytes:20027 (19.5 KiB)
=================================================================================


ib_acme works, if I use LIDs, GUIDs or (host)names, but it fails for IPs (with 'Connection timed out').
See the following outputs:
=================================================================================
rc002 ~/ $ ib_acme -f i -s 10.0.0.51 -d 10.0.0.52 -v -P -V
Service: localhost
Destination: 10.0.0.52
Source: 10.0.0.51
ib_acm_resolve_ip failed: Connection timed out
SA verification: failed Cannot assign requested address

Error Count,Resolve Count,No Data,Addr Query Count,Addr Cache Count,Route Query Count,Route Cache Count
localhost,1,2,0,1,0,0,0
return status 0x0

rc002 ~/ $ cat /var/log/ibacm.log
...
1363871789.369: acm_svr_accept:
1363871789.369: acm_svr_accept: assigned client 0
1363871789.369: acm_server: receiving from client 0
1363871789.369: acm_svr_receive: client 0
1363871789.369: acm_svr_resolve_dest: client 0
1363871789.369: acm_svr_resolve_dest: src  10.0.0.51
1363871789.369: acm_get_ep: 10.0.0.51
1363871789.369: acm_svr_resolve_dest: dest 10.0.0.52
1363871789.369: acm_acquire_dest: 10.0.0.52
1363871789.369: acm_get_dest: 10.0.0.52 not found
1363871789.369: acm_alloc_dest: 10.0.0.52
1363871789.369: acm_svr_resolve_dest: sending resolve msg to dest
1363871789.369: acm_send_resolve:
1363871789.369: acm_alloc_send: get dest ff15:4001:ffff:3:400::
1363871789.369: acm_alloc_send: 0x663b50
1363871789.369: acm_init_send_req: 0x663b50
1363871789.369: acm_post_send: posting send to QP
1363871789.369: acm_svr_queue_req: client 0
1363871789.369: acm_alloc_req: client 0, req 0x663d10
1363871789.369: acm_put_dest: 10.0.0.52
1363871789.369: acm_complete_send: waiting for response
1363871789.369: acm_process_recv: base endpoint name rc002
1363871789.369: acm_process_acm_recv:
1363871789.369: acm_process_acm_recv: src  10.0.0.51
1363871789.370: acm_process_acm_recv: dest 10.0.0.52
1363871789.370: acm_process_acm_recv: unsolicited request
1363871789.370: acm_process_addr_req:
1363871789.370: acm_acquire_dest: 10.0.0.51
1363871789.370: acm_get_dest: 10.0.0.51
1363871789.370: acm_process_addr_req: dest state 4
1363871789.370: acm_complete_queued_req: status 0
1363871789.370: acm_put_dest: 10.0.0.51
1363871792.394: acm_process_wait_queue: notice - retrying request
1363871792.394: acm_complete_send: waiting for response
1363871792.394: acm_process_recv: base endpoint name rc002
1363871792.394: acm_process_acm_recv:
1363871792.394: acm_process_acm_recv: src  10.0.0.51
1363871792.394: acm_process_acm_recv: dest 10.0.0.52
1363871792.394: acm_process_acm_recv: unsolicited request
1363871792.394: acm_process_addr_req:
1363871792.394: acm_acquire_dest: 10.0.0.51
1363871792.394: acm_get_dest: 10.0.0.51
1363871792.394: acm_process_addr_req: dest state 4
1363871792.394: acm_complete_queued_req: status 0
1363871792.394: acm_put_dest: 10.0.0.51
1363871795.419: acm_process_wait_queue: notice - retrying request
1363871795.419: acm_complete_send: waiting for response
1363871795.419: acm_process_recv: base endpoint name rc002
1363871795.419: acm_process_acm_recv:
1363871795.419: acm_process_acm_recv: src  10.0.0.51
1363871795.419: acm_process_acm_recv: dest 10.0.0.52
1363871795.419: acm_process_acm_recv: unsolicited request
1363871795.419: acm_process_addr_req:
1363871795.419: acm_acquire_dest: 10.0.0.51
1363871795.419: acm_get_dest: 10.0.0.51
1363871795.419: acm_process_addr_req: dest state 4
1363871795.419: acm_complete_queued_req: status 0
1363871795.419: acm_put_dest: 10.0.0.51
1363871798.444: acm_process_wait_queue: notice - failing request
1363871798.444: acm_process_timeouts: notice - dest 10.0.0.52
1363871798.444: acm_process_addr_resp: resp status 0x6
1363871798.444: acm_complete_queued_req: status 6
1363871798.444: acm_complete_queued_req: completing request, client 0
1363871798.444: acm_client_resolve_resp: client 0, status 0x6
1363871798.444: acm_free_req: 0x663d10
1363871798.444: acm_put_dest: 10.0.0.52
1363871798.444: acm_server: receiving from client 0
1363871798.444: acm_svr_receive: client 0
1363871798.444: acm_svr_query_path: client 0
1363871798.444: acm_get_ep: 9546:1ac:3900:0:4047:4000::
1363871798.444: acm_get_ep: notice - could not find 9546:1ac:3900:0:4047:4000::
1363871798.444: acm_svr_query_path: notice - could not find local end point
1363871798.444: acm_client_query_resp: status 0x7
1363871798.444: acm_server: receiving from client 0
1363871798.444: acm_svr_receive: client 0
1363871798.444: acm_svr_perf_query: client 0
1363871798.445: acm_server: receiving from client 0
1363871798.445: acm_svr_receive: client 0
1363871798.445: acm_svr_receive: client disconnected
=================================================================================


On the second node, the ib_acme command fails only for IPs, too. But it returns with a different message ('Cannot assign requested address'):
=================================================================================
rc003 ~/tmp/ibacm-1.0.7 $ ib_acme -f i -s 10.0.0.52 -d 10.0.0.51 -v -P -V
Service: localhost
Destination: 10.0.0.51
Source: 10.0.0.52
ib_acm_resolve_ip failed: Cannot assign requested address
SA verification: failed Cannot assign requested address

Error Count,Resolve Count,No Data,Addr Query Count,Addr Cache Count,Route Query Count,Route Cache Count
localhost,1,2,0,0,0,0,0
return status 0x0

rc003 ~/ $ cat /var/log/ibacm.log
...
1363872021.460: acm_svr_accept:
1363872021.460: acm_svr_accept: assigned client 0
1363872021.460: acm_server: receiving from client 0
1363872021.460: acm_svr_receive: client 0
1363872021.460: acm_svr_resolve_dest: client 0
1363872021.460: acm_svr_resolve_dest: src  10.0.0.52
1363872021.460: acm_get_ep: 10.0.0.52
1363872021.460: acm_get_ep: notice - could not find 10.0.0.52
1363872021.460: acm_svr_resolve_dest: notice - unknown local end point
1363872021.460: acm_client_resolve_resp: client 0, status 0x7
1363872021.460: acm_server: receiving from client 0
1363872021.460: acm_svr_receive: client 0
1363872021.460: acm_svr_receive: client disconnected
=================================================================================


Either, I made a mistake with the IPoIB configuration of the nodes (but most of the IPoIB features are usable, as explained before)
or there is a problem in librdmacm or ib_acme.
The OpenSM of the cluster is running on node rc002 (this might explain the difference in the output of ib_acme as shown above).

I'm kind of stuck and have no idea on how to further investigate of the error.
Hopefully, someone on this list is able to help me.

Thank you in advance,
Jens


--------------------------------
Dipl.-Math. Jens Domke
Research Assistant

Technische Universitaet Dresden
Center for Information Services and High Performance Computing (ZIH)
Interdisciplinary Application Development and Coordination
01062 Dresden
Tel.: +49 (351) 463-39114
Fax: +49 (351) 463-37773
E-Mail: jens.domke at tu-dresden.de
--------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4624 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130321/8a8a5a34/attachment.bin>


More information about the ewg mailing list