[ewg] OFED-1.5.1 failure over iWarp

Hefty, Sean sean.hefty at intel.com
Fri Jan 29 14:49:44 PST 2010


>>>> I am getting the following error when trying to run Intel MPI
>>>> over nes iwarp cards on today's daily build of OFED-1.5.1.
>>>> OFED-1.5 does not show this problem.

Thanks to Steve's help, it looks like the problem is a result of combining these two patches:

core_0390_rdma_cm_fix_loopback_address.patch
This is an upstream fixed for loopback support running over IB that was added to 2.6.33.

core_0450_ib_core-RoCEE-CMA-device-binding.patch
This modifies the rdma_cm path for RoCEE.  It includes this change:

static inline void rdma_addr_get_sgid(struct rdma_dev_addr *dev_addr, union ib_
gid *gid)
 {
-       memcpy(gid, dev_addr->src_dev_addr + rdma_addr_gid_offset(dev_addr), siz
eof *gid);
+       if (dev_addr->transport == RDMA_TRANSPORT_IB &&
+           dev_addr->dev_type != ARPHRD_INFINIBAND)
+               rocee_addr_get_sgid(dev_addr, gid);
+       else
+               memcpy(gid, dev_addr->src_dev_addr +
+                      rdma_addr_gid_offset(dev_addr), sizeof *gid);
 }

Apparently dev_addr->transport has not been set when rdma_addr_get_sgid() is called.  The memory was zalloc'ed so transport is set to 0, which happens to be RDMA_TRANSPORT_IB.  The result is that rocee_addr_get_sgid() is called for iwarp devices.

dev_addr->transport needs to be initialized somewhere, but I'm not sure where.  I also think the if statement above should be changed to:

if (dev_addr->transport == RDMA_TRANSPORT_IB &&
    dev_addr->dev_type == ARPHRD_ETHER)

(or whatever ARPHRD_* is for rocee.)  Just the change to the if statement may be sufficient to re-enable iwarp.

- Sean



More information about the ewg mailing list