[ewg] OFED-1.5.1 failure over iWarp

Steve Wise swise at opengridcomputing.com
Mon Feb 1 09:17:51 PST 2010


Sean, are you going to push the changes?   If not, who is?

Thanks,

Steve.


Hefty, Sean wrote:
>>>>> I am getting the following error when trying to run Intel MPI
>>>>> over nes iwarp cards on today's daily build of OFED-1.5.1.
>>>>> OFED-1.5 does not show this problem.
>>>>>           
>
> Thanks to Steve's help, it looks like the problem is a result of combining these two patches:
>
> core_0390_rdma_cm_fix_loopback_address.patch
> This is an upstream fixed for loopback support running over IB that was added to 2.6.33.
>
> core_0450_ib_core-RoCEE-CMA-device-binding.patch
> This modifies the rdma_cm path for RoCEE.  It includes this change:
>
> static inline void rdma_addr_get_sgid(struct rdma_dev_addr *dev_addr, union ib_
> gid *gid)
>  {
> -       memcpy(gid, dev_addr->src_dev_addr + rdma_addr_gid_offset(dev_addr), siz
> eof *gid);
> +       if (dev_addr->transport == RDMA_TRANSPORT_IB &&
> +           dev_addr->dev_type != ARPHRD_INFINIBAND)
> +               rocee_addr_get_sgid(dev_addr, gid);
> +       else
> +               memcpy(gid, dev_addr->src_dev_addr +
> +                      rdma_addr_gid_offset(dev_addr), sizeof *gid);
>  }
>
> Apparently dev_addr->transport has not been set when rdma_addr_get_sgid() is called.  The memory was zalloc'ed so transport is set to 0, which happens to be RDMA_TRANSPORT_IB.  The result is that rocee_addr_get_sgid() is called for iwarp devices.
>
> dev_addr->transport needs to be initialized somewhere, but I'm not sure where.  I also think the if statement above should be changed to:
>
> if (dev_addr->transport == RDMA_TRANSPORT_IB &&
>     dev_addr->dev_type == ARPHRD_ETHER)
>
> (or whatever ARPHRD_* is for rocee.)  Just the change to the if statement may be sufficient to re-enable iwarp.
>
> - Sean
>   




More information about the ewg mailing list