[libfabric-users] what does -FI_EALREADY mean?
Hefty, Sean
sean.hefty at intel.com
Mon Jun 29 13:02:49 PDT 2020
> In running a test case at moderately large scale (64 nodes, 128 tx endpoints per node)
> on a Cray CS system with libfabric 1.10.1 and the verbs;ofi_rxm provider, we saw a -
> FI_EALREADY ("Operation already in progress") return value from a fi_write() call. Can
> anyone out there give me more information as to what that error code might indicate is
> going wrong? The man pages don't really contain anything except that error text.
Searching through the code, I only see FI_EALREADY in a few places, all of which should only be for internal error handling. For example, RXM uses this to detect if a connection is already in progress, but I don't see that the error code can be returned to the user. Similarly, verbs has a couple of assertions that FI_EALREADY isn't returned as an error when inserting items into rbtrees. A free build could return that value back to the user.
It's possible this is coming from lower level code (e.g. verbs), but I'm skeptical of that.
Can you run with a debug build to see if you're going through one of the assert paths? Do you know if you're using XRC for the underlying transport? The verbs FI_EALREADY asserts are in XRC code paths.
- Sean
More information about the Libfabric-users
mailing list