[ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Fri May 1 11:06:01 PDT 2020

> It was discussed that with some applications, all or most data would be required to be
> persistent. The solution at the time was to provide FI_COMMIT_COMPLETE as part of the
> default TX op_flags at the time, which would incur a higher cost to provide that level
> of completion. The goal with this proposal would be to allow upper layers to set a less
> strict completion model, such as delivery or transmit complete as part of the default
> op_flag, or per-operation flag and address persistence as a batch operation via the
> fi_commit API.

OFI supports:

- different completion levels on a per transfer basis
- relaxed completion ordering
- relaxed message ordering

These are significant feature differences from common lower-level transports and worth maintaining.  The difficulty is at some point an application needs to ensure that all prior transfers are now at the same completion level without losing these features.

This is in large part what is being discussed here.  The persistent memory flush proposal is just one mechanism with a narrowly defined scope.

> The proposed definition is limited to RMA and AMO because we didn't have a strong use
> case for messaging, but I'd like to go the route that allows messaging to be easily
> included if that changes later down the road.

I would like to include messaging as well, but it is harder to define.  There's no easy way to reference the messages to 'commit', like you can for an RMA region.

Basically, I think we have an API solution for 'fenced based completion levels'.  iWarp NICs can do FI_FENCE | FI_INJECT_COMPLETE.  IB NICs can do FI_FENCE | FI_TRANSMIT_COMPLETE.  Some special HPC focused NICs can do FI_FENCE | FI_DELIVERY_COMPLETE.  No NICs work with FI_FENCE | FI_COMMIT_COMPLETE, and any future architecture path to get there is hazy.

I'd say that we take messaging off the table as an API goal, but it's still an important discussion point, so that we understand the scope of the issues.

> Agreed on the CQ aspect. As a note, EQs are not being discussed for the initiator, only
> the target, so I'll put my EQ comments in the next comment. As a general comment, I
> think that this could be a good candidate for discussion at the next OFIWG because it
> is a strange grey area to me.

CQ related operations are considered fast path and may be used to drive progress.  The EQ does not have these requirements.  Commit/flush is a data *operation* request, with the expectation that it is backed by a wire protocol that is in-band with other data operations.  The reason for adding such a call is purely for performance.

> I agree that this needs to be defined in terms of application-level behavior. However,
> I do think we need to talk about if and how applications should be expected to
> facilitate the desired functionality if the hardware is not capable of it.  The 'how'
> aspect of a provider like sockets implements the functionality isn't important to
> define here, but if the provider needs the application to interact/configure in a
> specific way then I think that should be covered here. If there isn’t hardware support
> for FI_COMMIT_COMPLETE, then it seems to become a much more difficult problem.
> Libfabric could provide events to the application through EQ or CQ events, or go a
> similar route as HMEM is going now. I'd prefer to provide events to the application
> rather than attempt to support every PMEM library/hardware when handling the software
> emulation case.

I'm going to continue to disagree on the hardware / software aspects.  Either a provider supports a feature or it doesn't.  It doesn't matter if the hardware support is in the NIC, FPGA, switches, CPU, chipset, or is some combination depending on the operations or scale involved.  Surprisingly, a lot of people at Intel consider the CPU to be hardware.  :)  There are many ways of implementing these features.

The provider not supporting a feature, or specifying some mode bit, is the indication that the application must do something different.

> >     The FI_RMA_PMEM capability should be sufficient to indicate support for RMA reads
> and writes to persistent memory.  That should be an inclusive flag (along with the API
> version) indicating that all related operations are supported.
> 
> Something like this?
> 
> #define FI_PMEM  (FI_RMA_PMEM | FI_AMO_PMEM | FI_MSG_PMEM)

This isn't quite what I had in mind, but not far off either.  I meant if the provider reports that it supports FI_RMA_PMEM, then that means that it supports registration of persistent memory, FI_COMMIT_COMPLETE, whatever fi_commit() operation is dreamed up, etc.  That is, we don't break the pieces up, an entire usable solution needs to be there.

If we want to support atomic operations to persistent memory, we add an FI_ATOMIC_PMEM capability.  If that's set, the same conditions apply (MR support, FI_COMMIT_COMPLETE, etc. but for atomics).

> I'm not yet convinced on FI_COMMIT_COMPLETE|FI_FENCE. If libfabric suggested the use of
> that, does that imply that providers must support 0-length sends and/or control
> messaging on behalf of the application ? Does the data transfer itself provide any
> context to the region being flushed? What happens in the case of multiple persistent
> memory domains or devices? How would that data transfer provide the context necessary
> to flush a specific region, memory domain, or device? This seems more complicated than
> the initial suggestion indicates.

It's possible that a provider may not report support for FI_FENCE when FI_RMA_PMEM is requested.  But I'm suggesting FI_FENCE as the more generic solution to 'upgrading' the completion level of all prior operations, and it is not restricted to FI_COMMIT_COMPLETE or persistent memory.  We can specify a memory region with RMA or atomics.  We can't with messages.

This does not preclude adding an fi_commit() operation specifically targeting RMA/atomic memory regions. 

> It's a good argument, and generally I feel the same way. What do you suggest as an
> alternative? Callbacks were suggest as a way for the provider to do some behavior on
> behalf of the application upon the receipt of the associated event. This would have
> allowed the provider to issue the commit/flush to device and then return the ACK back
> to the initiator that the commit had succeeded/data was flushed as requested. Without a
> callback, I do not see a clean way for libfabric to coordinate flush and
> acknowledgement back to the initiator.

Commit/flush should just write a completion into the CQ, similar to other operations.

> >    To be clear, the proposal only supports RMA writes, and maybe atomics, to the
> target memory.  That is likely sufficient for now, but I'd like to ensure that we have
> a way to extend pmem support beyond the limited use cases being discussed.
> 
> RMA, and atomics -- with the intent not to exclude messaging. This is why the naming
> change from FI_RMA_PMEM to FI_PMEM was suggested.

I would exclude messages unless we can define a complete solution for how it works, and have at least one provider implement it.

- Sean