[ewg] OFED-1.5.1 failure over iWarp
Hefty, Sean
sean.hefty at intel.com
Fri Jan 29 14:49:44 PST 2010
>>>> I am getting the following error when trying to run Intel MPI
>>>> over nes iwarp cards on today's daily build of OFED-1.5.1.
>>>> OFED-1.5 does not show this problem.
Thanks to Steve's help, it looks like the problem is a result of combining these two patches:
core_0390_rdma_cm_fix_loopback_address.patch
This is an upstream fixed for loopback support running over IB that was added to 2.6.33.
core_0450_ib_core-RoCEE-CMA-device-binding.patch
This modifies the rdma_cm path for RoCEE. It includes this change:
static inline void rdma_addr_get_sgid(struct rdma_dev_addr *dev_addr, union ib_
gid *gid)
{
- memcpy(gid, dev_addr->src_dev_addr + rdma_addr_gid_offset(dev_addr), siz
eof *gid);
+ if (dev_addr->transport == RDMA_TRANSPORT_IB &&
+ dev_addr->dev_type != ARPHRD_INFINIBAND)
+ rocee_addr_get_sgid(dev_addr, gid);
+ else
+ memcpy(gid, dev_addr->src_dev_addr +
+ rdma_addr_gid_offset(dev_addr), sizeof *gid);
}
Apparently dev_addr->transport has not been set when rdma_addr_get_sgid() is called. The memory was zalloc'ed so transport is set to 0, which happens to be RDMA_TRANSPORT_IB. The result is that rocee_addr_get_sgid() is called for iwarp devices.
dev_addr->transport needs to be initialized somewhere, but I'm not sure where. I also think the if statement above should be changed to:
if (dev_addr->transport == RDMA_TRANSPORT_IB &&
dev_addr->dev_type == ARPHRD_ETHER)
(or whatever ARPHRD_* is for rocee.) Just the change to the if statement may be sufficient to re-enable iwarp.
- Sean
More information about the ewg
mailing list