[ofiwg] definition of a completion in OFI
Hefty, Sean
sean.hefty at intel.com
Wed Mar 4 13:46:54 PST 2015
I'm seeing a problem running fabtests over the sockets provider that is exposing an issue in what it means for an operation to be complete. As defined, a completion means "that the application's buffers may be re-used". This seems like a minimal definition that would work with any implementation, but it leads to this issue:
App 1 issues a send to app 2.
Provider 1 queues the send, making use of internal buffering.
Provider 1 generates a completion.
App 1 exits.
Data from app 1 is discarded or lost
The result is app 2 hangs waiting for data that never shows up. (This becomes a 2-armies problem.)
I see a couple of solutions for this. One is to provide stronger requirements on when a completion can be generated, such as:
Completion: For reliable requests, indicates that the operation and its associated data has been acknowledged by the destination. For unreliable requests, indicates that the request has successfully been transmitted into the fabric.
I'm not sure if all implementations can adhere to this definition. I looked in the iWarp and IB specs, but I couldn't find any specific definition of what it means for an app to retrieve a completion.
A second solution is to add the behavior defined above to another call or event, such as fi_shutdown. For example:
fi_shutdown() - For reliable endpoints, blocks until all operations and their associated data have been acked by the destination. For unreliable endpoints, indicates that all requests have successfully been transmitted into the fabric.
In this case, calling fi_close() without fi_shutdown() will abruptly close the endpoint. There may need to be other constraints for fi_shutdown(), such as the app must ensure that all requests have completed.
Thoughts?
- Sean
More information about the ofiwg
mailing list