[libfabric-users] Linux verbs / Windows netdir Interoperability

dshinaberry at MRU.MEDICAL.CANON dshinaberry at MRU.MEDICAL.CANON
Fri Nov 5 11:05:20 PDT 2021

Hello list users,

So, after getting our use case working from Windows to Windows, I dove into attempting Linux/Windows interoperability with Linux being the sender and Windows being the receiver. The two endpoints connected after accounting for the fact that the verbs library silently adds and removes a byte at the start of the CM data, while the netdir provider does not.

However, the fi_sendv from Linux failed on the netdir side with an ND_DATA_OVERRUN error status. After digging into the netdir provider code a bit, I think I see the source of the problem. It appears that the receive buffer provided in the call to fi_recvv is not immediately sent to the ND2 implementation, but is queued inside the netdir provider. It appears that the netdir provider expects a small incoming message indicating that a large message transfer is being requested. When processing this small initiation message, the receive buffer is then processed and passed to the ND2 implementation. Then the large message is transferred and processed.

So, it appears that some effort will be needed to make changes to the netdir provider in order to support our desired use case. My manager is on board with our making the effort to achieve our desired use case and contributing the code back upstream. Our concern is that since none of us are particularly experienced with libfabric or Network Direct, we don't know whether there might be some hard limitation in the Network Direct API that would stop us dead in our tracks from achieving interoperability between Linux verbs and Windows netdir.

We're wondering whether anyone here knows why the netdir provider might have been designed the way that it is. Or even better, if the original implementers might be available for consultation about their design choices and knowledge about what limitations we would face in attempting to make changes to achieve our use case.

Many Thanks,

Derek Shinaberry
Senior Software Engineer, Platform Software
Canon Medical Research USA, Inc.
706 N. Deerpath Drive, Vernon Hills, IL 60061, USA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20211105/8bdae462/attachment.htm>

More information about the Libfabric-users mailing list