[libfabric-users] netdir / fi_endpoint

dshinaberry at MRU.MEDICAL.CANON dshinaberry at MRU.MEDICAL.CANON
Mon Nov 1 11:39:36 PDT 2021


Hello libfabric users,

I am working on bringing up code that will be transferring data via RoCE among Linux and Windows hosts equipped with Mellanox ConnectX-5 NICs.

We have Linux to Linux communication working fine with the verbs provider. I am encountering some issues with the netdir provider on Windows. I am using version 1.13.0 of libfabric at the moment.

I will do my best to limit each thread to a single issue so as not to confuse any follow on discussion.

So, the first place our code stumbled was with a call to fi_endpoint on the subscribing side. It returned FI_OTHER, which was not super helpful. Digging deeper into the netdir provider layer of the code, I could see that Network Direct call was returning ND_INSUFFICIENT_RESOURCES.

In searching for a reason, I stumbled across a known issue with the Mellanox NIC<https://docs.mellanox.com/display/winof2v27051000/Known+Issues>:

2690140



Description: Requests of QPs with a string of values set to "max" (e.g., Max Queue Depth + Max SGE counter + Max inline Data size) cannot be processed by the driver as their accumulative size overcomes the WQ maximum size.

Workaround: N/A

Keywords: ND QP creation

Detected in version: 2.70.50000


I was hoping to find that I could control the values that were requested, but that appears not to be the case. I found the code in question, prov/netdir/src/netdir_ep.c:ofi_nd_ep_control, was simply passing in the maximum values it obtained from an earlier query.

When I just blindly cut the values for MaxReceiveQueueDepth and MaxInitiatorQueueDepth by 50%, the routine started returning FI_SUCCESS.

So, for now, I am just leaving that hack in place and building a locally modified version of libfabric in order to proceed.

It seems to me that since the values being utilized are maximums, that it would be reasonable to expect to be able to query them via fi_getopt and then choose something other than the default via fi_setopt. I am new to libfabric, so perhaps there is already some other mechanism in place to achieve this that I am unaware of.

Thanks for the help,
Derek

Derek Shinaberry
Senior Software Engineer, Platform Software
Canon Medical Research USA, Inc.
706 N. Deerpath Drive, Vernon Hills, IL 60061, USA
www.research.us.medical.canon<http://www.research.us.medical.canon/>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20211101/544aadf0/attachment.htm>


More information about the Libfabric-users mailing list