[libfabric-users] Two-stage completion

Fri Sep 16 10:09:44 PDT 2016

> I'm not sure that using fi_tsendmsg with the FI_INJECT flag would meet
> Jonathan's requirements - if I'm reading this email chain correctly.
> I'll reorder his requirements and add comments:
> 
> >	4. The application doesn't perform extra copy operations on the
> message unless it's completely unavoidable.
> 
> You also don't want the *provider* to do any internal memory copies. In
> my mind, using FI_INJECT means "release ownership of the source buffer
> and return from this function ASAP". Maybe the provider doesn't have
> hardware support for this and needs to memcpy, or maybe the hardware is
> busy and the provider needs to memcpy. Also, maybe the provider doesn't
> support a large enough "inject size" transmit attribute - the
> application would have to memcpy. Or maybe the provider implements an
> arbitrarily large "inject size" using and internal memcpy?
> 
> >	1. Blocking send semantics over unconnected endpoints
> 
> This is a blocking send from the perspective of his wrapper library,
> right?  The library  implementation would be to spin on fi_cq_read()
> until a FI_INJECT_COMPLETE event and then return to the application.

We need to be very clear what semantic 'blocking send' needs to convey.  Sockets does blocking send calls, but that doesn't mean anything regarding the location of the message when the call returns.  If the desired semantic is that the buffer may be immediately re-used/freed/modified, then the FI_INJECT flag is the correct mapping.

The API does not dictate an implementation, and IMO apps should focus on performance requirements, not implementation details that they may believe lead to lower performance. 

> >	2. You get to send again as soon as the buffer is safe to write
> to (currently my use case for the cq),
> 
> Same as waiting for FI_INJECT_COMPLETE event.
> 
> >	3. You also get some kind of event when we're sure the
> destination received the event
> 
> This is just FI_DELIVERY_COMPLETE|FI_TRANSMIT_COMPELTE
> 
> The real question is if two events can be generated for one operation.
> The documentation for the FI_*_COMPLETE flags in the fi_endpoint man
> page all start like this:
> 
> 	"Indicates that a completion should be generated when ..."
> 
> "... a completion ..." sounds like there could be more than one
> completion event. It should say "... the completion ..." if only one
> event can be generated for each operation.

We want a single completion per operation.  Mere humans need to write to this API.

> Earlier, Sean said "The timing difference between the two completion
> models seems minimal, especially compared to the time needed to
> generate, transfer, and process an additional ack." But, this is
> comparing FI_TRANSMIT_COMPLETE and FI_DELIVERY_COMPLETE.  The timing
> difference is quite noticeable between FI_INJECT_COMPLETE (local
> operation) and FI_TRANSMIT_COMPLETE|FI_DELIVERY_COMPLETE (remote
> operation). As Jeff Hammond pointed out, several different middleware
> could use - needs - this distinction for best performance.

I was comparing transmit versus delivery complete because those were the completion modes called out in the previous emails.

But if you want to go further and compare say inject versus delivery complete, I'm still not convinced that multiple completions are beneficial.

When you consider reliability, the source data needs to remain untouched until it has been acked by the remote side.  This would seem to imply that either the source data is copied or that "local completion" == "remote completion".  In the former case, source data may be copied to another memory buffer, the transmit queue, or NIC memory.  FI_INJECT covers this case.

- Sean