[ofiwg] completion flags as actually defined by OFI

Wed Apr 15 09:36:13 PDT 2015

> 
> On Wed, Apr 15, 2015 at 12:35:32AM +0000, Weiny, Ira wrote:
> 
> > > To me, the only way the above makes sense, is with dedicated
> > > hardware support. Which doesn't exist today.
> >
> > Even then I think the provider can return the FI_COMMIT_COMPLETE which
> > indicates an application level of completion ...  Only the application
> > knows that the memory is NVM and that storing in NVM is what is
> > important.  Libfabric can't make that determination on its own.
> 
> I could see a future where somone wants to optimize this, it would be incredibly
> useful for storage and database scenarios. The goal would be to have a
> completion at the sender know that the data is in peer persistent NVM without
> involving the peer CPU.
> 
> Obviously this requires that the HCA (never mind libfabric) and host bridge
> know how to make and handle the right cache bypass and fence PCI-E TLPs to
> make this possible.

I agree that an optimization like this would be nice.  However, I am arguing that having a completion named "*_PERSISTENT_*" is wrong because it introduces an application concept into the libfabric layer.  Only applications which are aware of the persistence of this data would be able to recover after a crash of the App or OS.  So it is up to the application to configure the provider/hardware to actually send that type of completion.  So while the completion event may not be sent from the application in the fast path it still needs to be involved.

The best alternative use case I could think of is hardware collective offload.

In this case the inbound data is not persistent but is used by the collective hardware without application involvement.  The completion is that the data has been consumed but is not persistent.  (I could even see the data being consumed at a switch!)  The hardware can still signal a FI_COMMIT_COMPLETE but the meaning is application specific.

I know this is thin, and without real hardware, is all speculative but I don't want to see libfabric "API explosion" where every little hardware feature is a new enum or api.

Perhaps all we can do at this point is recognize that new completions may be necessary and we allocate spare bits for them?

> 
> Without that future, everything involves the CPU, so it may as well use an
> application ACK API which allows anything to happen.

I agree, it is hard to predict what the future will hold.

Ira