[ofiwg] definition of a completion in OFI
sean.hefty at intel.com
Wed Mar 4 22:49:09 PST 2015
> The problem is not with the definition of completion at the initiator, but
> with the definition (or implementation) of the target. And, in your
> example, I suspect the target doesn't really "hang", but more like "hangs
> until the TCP connection times out". If you don't like the length of the
> time-out, I'm pretty sure the provider could set an option for how long it
> will wait ;-)
The socket provider doesn't maintain a 1:1 pairing between an OFI endpoint and a TCP socket, even in the connected case. As a result the test actually hangs waiting to read a completion for a posted receive. You are correct, the problem is seen at the target. The cause is that the initiator thinks it's done and exits, leaking random bits into the abyss.
This isn't some enterprise worthy application that handles failover or abnormal termination. The app simply sends a message from the client to the server and back. That's it. And under normal circumstances, with normal exit behavior, it doesn't work. This fails for both the reliable-connected and reliable-unconnected pingpong tests.
If the apps need to do something different, then we should at least define what that should be. If what the tests are doing is fine, then the provider needs to do something different. And that behavior should likewise be captured somewhere. Maybe changes are needed in both, as Jason is implying.
OFI defines both local and remote completion concepts. At this point, I think everyone is in agreement that this is a problem in the shutdown/close semantics and implementation, and not the completion semantics.
More information about the ofiwg