[ewg] OFED-1.5.1 failure over iWarp
Steve Wise
swise at opengridcomputing.com
Mon Feb 1 09:17:51 PST 2010
Sean, are you going to push the changes? If not, who is?
Thanks,
Steve.
Hefty, Sean wrote:
>>>>> I am getting the following error when trying to run Intel MPI
>>>>> over nes iwarp cards on today's daily build of OFED-1.5.1.
>>>>> OFED-1.5 does not show this problem.
>>>>>
>
> Thanks to Steve's help, it looks like the problem is a result of combining these two patches:
>
> core_0390_rdma_cm_fix_loopback_address.patch
> This is an upstream fixed for loopback support running over IB that was added to 2.6.33.
>
> core_0450_ib_core-RoCEE-CMA-device-binding.patch
> This modifies the rdma_cm path for RoCEE. It includes this change:
>
> static inline void rdma_addr_get_sgid(struct rdma_dev_addr *dev_addr, union ib_
> gid *gid)
> {
> - memcpy(gid, dev_addr->src_dev_addr + rdma_addr_gid_offset(dev_addr), siz
> eof *gid);
> + if (dev_addr->transport == RDMA_TRANSPORT_IB &&
> + dev_addr->dev_type != ARPHRD_INFINIBAND)
> + rocee_addr_get_sgid(dev_addr, gid);
> + else
> + memcpy(gid, dev_addr->src_dev_addr +
> + rdma_addr_gid_offset(dev_addr), sizeof *gid);
> }
>
> Apparently dev_addr->transport has not been set when rdma_addr_get_sgid() is called. The memory was zalloc'ed so transport is set to 0, which happens to be RDMA_TRANSPORT_IB. The result is that rocee_addr_get_sgid() is called for iwarp devices.
>
> dev_addr->transport needs to be initialized somewhere, but I'm not sure where. I also think the if statement above should be changed to:
>
> if (dev_addr->transport == RDMA_TRANSPORT_IB &&
> dev_addr->dev_type == ARPHRD_ETHER)
>
> (or whatever ARPHRD_* is for rocee.) Just the change to the if statement may be sufficient to re-enable iwarp.
>
> - Sean
>
More information about the ewg
mailing list