[ofiwg] definition of a completion in OFI
grun at cray.com
Wed Mar 4 14:15:39 PST 2015
In the case of IB, the completion behavior is exactly as you describe below as a matter of specification for both reliable and unreliable services.
To me, if the provider accepts the data for transmission, it should not generate a completion until it has in fact completed the operation. For a reliable service, that means it gets all its acks back, for unreliable it means that all data has been put on the wire.
I believe that the behavior you are describing below is characteristic of sockets, and I imagine that a sockets application expects this kind of shortcut and can deal with it. But in our case sockets is simply another provider that lives below the OFI API. I suggest that we require the behavior that is exposed to the application to be consistent (i.e. no completion until the data is actually transmitted) and therefore force the sockets provider to deal with the fact that it somehow has to actually transfer the data before it signals a completion.
From: ofiwg-bounces at lists.openfabrics.org [mailto:ofiwg-bounces at lists.openfabrics.org] On Behalf Of Hefty, Sean
Sent: Wednesday, March 04, 2015 1:47 PM
To: ofiwg at lists.openfabrics.org
Subject: [ofiwg] definition of a completion in OFI
I'm seeing a problem running fabtests over the sockets provider that is exposing an issue in what it means for an operation to be complete. As defined, a completion means "that the application's buffers may be re-used". This seems like a minimal definition that would work with any implementation, but it leads to this issue:
App 1 issues a send to app 2.
Provider 1 queues the send, making use of internal buffering.
Provider 1 generates a completion.
App 1 exits.
Data from app 1 is discarded or lost
The result is app 2 hangs waiting for data that never shows up. (This becomes a 2-armies problem.)
I see a couple of solutions for this. One is to provide stronger requirements on when a completion can be generated, such as:
Completion: For reliable requests, indicates that the operation and its associated data has been acknowledged by the destination. For unreliable requests, indicates that the request has successfully been transmitted into the fabric.
I'm not sure if all implementations can adhere to this definition. I looked in the iWarp and IB specs, but I couldn't find any specific definition of what it means for an app to retrieve a completion.
A second solution is to add the behavior defined above to another call or event, such as fi_shutdown. For example:
fi_shutdown() - For reliable endpoints, blocks until all operations and their associated data have been acked by the destination. For unreliable endpoints, indicates that all requests have successfully been transmitted into the fabric.
In this case, calling fi_close() without fi_shutdown() will abruptly close the endpoint. There may need to be other constraints for fi_shutdown(), such as the app must ensure that all requests have completed.
ofiwg mailing list
ofiwg at lists.openfabrics.org
More information about the ofiwg