[libfabric-users] what does -FI_EALREADY mean?

Hefty, Sean sean.hefty at intel.com
Mon Jun 29 13:02:49 PDT 2020


> In running a test case at moderately large scale (64 nodes, 128 tx endpoints per node)
> on a Cray CS system with libfabric 1.10.1 and the verbs;ofi_rxm provider, we saw a -
> FI_EALREADY ("Operation already in progress") return value from a fi_write() call.  Can
> anyone out there give me more information as to what that error code might indicate is
> going wrong?  The man pages don't really contain anything except that error text.

Searching through the code, I only see FI_EALREADY in a few places, all of which should only be for internal error handling.  For example, RXM uses this to detect if a connection is already in progress, but I don't see that the error code can be returned to the user.  Similarly, verbs has a couple of assertions that FI_EALREADY isn't returned as an error when inserting items into rbtrees.  A free build could return that value back to the user.

It's possible this is coming from lower level code (e.g. verbs), but I'm skeptical of that.

Can you run with a debug build to see if you're going through one of the assert paths?  Do you know if you're using XRC for the underlying transport?   The verbs FI_EALREADY asserts are in XRC code paths.

- Sean


More information about the Libfabric-users mailing list