[ofiwg] definition of a completion in OFI

Wed Mar 4 22:07:51 PST 2015

On Thu, Mar 05, 2015 at 12:59:22AM +0000, Hefty, Sean wrote:

> > For RC the other side perceives this as an asynchronous notification
> > that the QP is destroy'd - either beacuse it got the CM mad, or
> > because it got an error for a data packet. This is not an end of
> > stream mark because it is not synchronous with the data transfer.
> 
> We can't be guaranteed of a kernel agent to do a linger.  Nor can we
> be guaranteed of IB or MADs .  An end of stream marker concept
> doesn't easily work for reliable unconnected.  I care about defining
> the expected behavior of the APIs, not the implementation.

Well, for the un-connected case, it makes sense to me that something
called close, or shutdown will complete all outstanding send work to
the point where app resources are no longer needed and then
return. It shouldn't cancel the work, but complete it - whatever that
means for the provider. That is similar to how unconnected sockets
work on close/shutdown.

For anything with reliable in the name that would have to be blocking
in the worst case, while sockets would never block on close, but close
does not reap all resources (kernel side lingers)

> Whatever is defined for reliable unconnected case seems that it
> should work for UD or RC.  The app needs some level of assurance
> that after it calls fi_blah or receives an FI_BLAH event that all
> data transfers have completed and it's safe to free all resources.

The connected case should be different.

As we've discussed it isn't really feasible to guarentee delivery
without an end of stream mark, so if the provider can't create end of
stream then it should just cancel all outstanding work on close and
closing a qp (either side) with entries in the send q should be
considered an application error.

This is better than 'best effort delivery' because it is more likely
to make the error visible and then fixed instead of as a hidden race
condition.

> I agree, but the developer at least needs to know what calls to
> invoke and events to look for.  They need to know if a call like
> fi_close() is going to block for a minute while CM MADs are retried,
> or if they should call fi_shutdown() and wait for an FI_SHUTDOWN
> event. The apps needs to know when it's safe to call either
> function.

Generally speaking, it has to be blocking, and it will be on the order
of a send timeout...

Some scenarios could guarentee non blocking, and it might be
interesting to expose that..

> The app needs to know if it can receive 2 events for the
> same request...

No, never. Completion means the wrid can be re-used, signaling error
with an already completed wrid is something that can never be
correctly handled by the app.

An error from already completed work would have to be reported as an
asynchronous event.

Jason