[libfabric-users] will RxM's EAGAIN handling work properly if I specify FI_INJECT_COMPLETE?

Hefty, Sean sean.hefty at intel.com
Tue Sep 1 14:40:13 PDT 2020


> The man page for the RxM utility provider says that because it establishes connections
> on demand as data transfers are done, one must be prepared to ride out one or more -
> FI_EAGAIN error returns from those initial data transfer calls.  Does this imply that
> one must also request at least FI_TRANSMIT_COMPLETE completion on those calls?  The
> fi_msg man page seems to say that -FI_EAGAIN can result from a lack of resources on
> either the local or remote side, but if I specify FI_INJECT_COMPLETE how could the data
> transfer function know about a lack of resources on the remote node if it returns
> before the operation has even reached there?

No - EAGAIN is an immediately failure.  It indicates that the operation was never accepted by the provider.  Completion semantics in that case do not matter, since no completion will be generated.

Regarding the documentation, it is simply not putting restrictions on the provider or the reason.  For example, the shared memory provider can return EAGAIN because of lack of resources at the peer.  Additionally, it's possible that flow control could prevent queuing transfers to a specific peer.  However, since that limitation is ultimately due to lack of resources at the peer, other operations, targeting a different peer, may be accepted.

- Sean


More information about the Libfabric-users mailing list