[ofiwg] completion flags as actually defined by OFI

Tue Apr 14 16:29:14 PDT 2015

On Tue, Apr 14, 2015 at 10:48:54PM +0000, Weiny, Ira wrote:

> I don't think you should say "consumed by the application".  Rather
> we should use your language above that "no fabric or peer local
> failures will prevent delivery."

So, as I've defined it, FI_DELIVERY guarentees the data will arrive
at the app buffer and a completion delivered to the app. That may not
have happened yet, but it will happen - unless a peer local failure occures.

FI_COMMIT_COMPLETE guarentees that the app has seen the completion and
done whatever other work it needs to do (ie fsync, flush to NVM or
whatever).

So, there are lots of in-between micro points between those two. One
of them is that the provider has copied the recv CQE into app memory
and is about to return. I think this is what the old language was
talking about.

What is the difference between the above paragraph and FI_DELIVERY?
Very little, it basically adds the guarentee that the data is visible
to the peer CPU (not in memory, just visible). But who cares? If a
peer local failure occurs then this memory likely is lost, and the
additional constraint has done nothing for us. Remember, peer local
failure is the only thing that breaks the FI_DELIVERY guarentee, and
peer local failure is a pretty big event.

Now, there are a few cases where the memory might not be
lost. non-volatile persistent memory for instance, or a shared memory
segment (and peer local failure is restricted to mean the app crash'd)

In this case, yes, it is valuable, and the semantic is obvious:

FI_PERSISTENT_DELIVERY_COMPLETE. The same as FI_DELIVERY_COMPLETE with
the additional guarentee that if the peer stored the message into
non-volatile memory, then that memory will retain the whole message
uncorrupted across an peer local failure. NVM will survive a peer
local failure including os failure and loss of power, shared memory
(anon or file backed) will survive a peer local failure including
process kill.

To me, the only way the above makes sense, is with dedicated hardware
support. Which doesn't exist today. Until it does, the CPU is involved
and you are better to use FI_COMMIT_COMPLETE and have the app signal
commit once it has done whatever sync is needed for the memory type it
is working with. That is clearly more flexable and gets libfabric out
of the messy business of WTF does 'persistent' mean for memory.

So, that's my rational for picking these points and not others.

> FWIW, I'm not sure if this is a good idea or not.  IMO it muddies
> the water between libfabric and the application.

There may be cases where libfabric can piggyback the application ack
on its own low level messaging and gain efficiency.

For instance I have an app that uses RDMA to implement stream
semantics, and in that scenario the low lavel protocol ack that frees
up space in the stream buffer actually does signal to the remote that peer
fsync() has completed, without an explicit message required.

Jason