[ofa-general] troubleshooting IB_CM_REJ_INVALID_SERVICE_ID in RDMA_CM_EVENT_REJECTED at active side of the connection

Isaac Huang He.Huang at Sun.COM
Wed Feb 4 20:47:28 PST 2009


Hi,

I got some RDMA_CM_EVENT_REJECTED errors at active sides (i.e. nodes
doing rdma_connect), after RDMA_CM_EVENT_ADDR_RESOLVED and
RDMA_CM_EVENT_ROUTE_RESOLVED.

Poking around in CM code told me that the passive side couldn't find a
listener with requested service_id on the incoming device of the
connection request.

I suspected that either the active side or passive side could have
been bound to a wrong IB device - both sides did have multiple IB
interfaces on the fabric. Our code did bind to correct local IP
addresses at both sides, src_addr in rdma_resolve_addr and
rdma_bind_addr before rdma_listen. However, I seemed to remember that
some old OFED versions had issues in rdma_translate_ip so that a wrong
interface could be returned, e.g. bug 726 and 325. Also, the active
side was running OFED 1.3.1 and passive side could be an older
version. Could you guys give me some tips for troubleshooting? Any
debugging options or /proc file to look at? Is there any netstat-like
tool in OFED (e.g. something like a "netstat -ltp" to find out who is
listening on which device)?

The other possible cause could be ARP flux, but unfortunately arping
via IPoIB always segfault on our systems. Is there any other way to
troubleshoot possible ARP flux issues?

BTW, pinging over IPoIB addresses worked fine.

Your suggestion is greatly appreciated.

Thanks,
Isaac



More information about the general mailing list