[ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Mon Apr 27 19:38:24 PDT 2020

Top-posting main discussion point.  Other comments further down:

Conceptually, what's being proposed is specifying a data transfer as a 2-step process.

1. identify the data source and target
2. specify the completion semantic

Theoretically, the actual data transfer can occur any time after step 1 and before step 2 completes.  As an additional optimization, step 2 can apply to multiple step 1s.

We need to decide:

A. What completion semantic applies to step 1?
B. What operations do we support for step 1?
C. What completion semantics are supported for step 2?

The current answers are:

A. All completion levels are supported.  It's possible that none of them are desirable here, and we need to introduce a new mode: FI_UNDEFINED_COMPLETE.  This would indicate that the buffer cannot be re-used, and the data is not visible at the target, until step 2 completes that covers the same target memory range.

B. RMA reads and writes are supported.  It shouldn't be difficult to support atomics through the same APIs as well.  Message transfers are more difficult to specify in step 2, making them harder to support.

C. The proposal only supports FI_COMMIT_COMPLETE.  Other levels could be added, though that may only make sense if we define something like FI_UNDEFINED_COMPLETE.

I'm throwing FI_UNDEFINED_COMPLETE out for discussion.  There would be issues trying to define it, since data transfers issued at step 1 could generate completions locally and remotely prior to step 2 being invoked.  Those completions just wouldn't mean anything until step 2 completes.  The provider would select the best completion option for step 1.

> Libfabric requires modifications to support RMA and atomic operations targeted at
> remote memory registrations backed by persistent memory devices. These modifications
> should be made with the intent to drive support for persistent memory usage by
> applications that rely on communications middleware such as SHMEM in a manner that is
> consistent with byte-based/stream-based addressable memory formats. Existing proposals
> (initial proposal) support NVMe/PMoF approaches, which this approach should support
> flat memory, non-block addressed memory structures and devices.
> 
> Changes may be required in as many as three areas:
> 
> *	Memory registration calls
> 
> 	*	This allows a memory region to be registered as being capable of
> persistence. This has already been introduced into the upstream libfabric GITHUB, but
> should be reviewed to ensure it matches use case requirements.

FI_RMA_PMEM is defined as a MR flag.  Note that this definition intentionally limits non-RMA transfers from taking advantage of persistent memory semantics.

The intent of this flag is to give providers implementation flexibility, specifically based on hardware/software differences.

> *	Completion semantics
> 
> 	*	These changes allow a completion event or notification to be deferred until
> the referenced data has reached the persistence domain at the target. This has already
> been introduced into the upstream libfabric GITHUB, but should be reviewed to ensure it
> matches use case requirements.

Completion semantics may be adjusted on a per transfer basis.  The FI_COMMMIT_COMPLETE semantic applies to both the initiator and target.  Completion semantics are a minimal guarantee from a provider.  The provider can do more.

> *	Consumer control of persistence
> 
> 	*	As presently implemented in the upstream libfabric GITHUB, persistence is
> determined on a transaction-by-transaction basis. It was acknowledged at the time that
> this is a simplistic implementation. We need to reach consensus on the following:
> 
> 		*	Should persistence be signaled on the basis of the target memory
> region? For example, one can imagine a scheme where data targeted at a particular
> memory region is automatically pushed into the persistence domain by the target,
> obviating the need for any sort of commit operation.

In cases where a commit operation is not needed, it can become a no-op, but it may be required functionality for some providers.

> 		*	Is an explicit 'commit' operation of some type required, and if so,
> what is the scope of that commit operation? Is there a persistence fence defined such
> that every operation prior to the fence is made persistent by a commit operation?

With the current API, persistence can be achieved by issuing a 0-length RMA with FI_COMMIT_COMPLETE | FI_FENCE semantics.  The fence requires that *all* prior transfers over that endpoint meet the requested completion semantic.

This may not be ideal, but may be the best way to handle message transfers to persistent memory.

> Proposal
> 
> The experimental work in the OFIWG/libfabric branch is sufficient for the needs of
> SHMEM, with exception to the granularity of event generation. When the current
> implementation generates events, it would generate commit-level completion events with
> every operation. That type of operation would make the delivery of completion events
> take longer than necessary for most operations, so SHMEM would need finer control over
> commit flushing behavior.

OFI does not require that an event be generated for every transfer.  It also allows transfers to report completions using 'lower' completion semantics, such as FI_TRANSMIT_COMPLETE.  Completion events at the target of an RMA write requires the FI_RMA_EVENT capability, and is independent from PMEM.

> To satisfy this, the following is being proposed:
> 
> *	A new API: fi_commit (See definitions: fi_commit)
> 	The new API would be used to generate a commit instruction to a target peer. The
> instruction would be defined by a set of memory registration keys, or regions by which
> the target could issue a commit to persistent memory.

See discussion at the top.

> 	*	A single request to fi_commit should generate a control message to target
> hardware or software emulation environment to flush the contents of memory targets.

This needs to be defined in terms of application level semantics, not implementation details.  fi_commit could be a no-op based on the provider implementation.  (It actually would be for the socket and tcp providers, which act at the target based on the MR flag.)

> Memory targets are defined by the iov structures, and key fields – and the number of
> memory targets are defined by the count field. The destination address is handled by
> the dest_addr field. The flags field is held reserved at this time to allow for
> flexibility in the API design to future proof against options we might not conceive of
> until after the prototype is complete, and the context available for the user and
> returned with the completion

The proposed definition is limited to RMA (and atomic) writes.  There is no mechanism for handling RMA reads into persistent memory, for example.  That should be included.  Message transfers may need a separate mechanism for this.  That can be deferred (left undefined by the man pages), but should ideally we should have an idea for how to support it.

The best existing API definition for an fi_commit call would be the fi_readmsg/fi_writemsg() calls.  We could even re-use those calls by adding a flag.

> 	*	Since this API behaves like a data transfer API, it is expected that this
> API would generate a completion event to the local completion queue associated with the
> EP from which the transaction was initiated against.

The generation of a *CQ* event makes sense.  We need to define if and how counters, locally and remote, are updated.  EQ events are not the right API match.

> 	*	At the target, this should generate an event to the target's event queue –
> if and only if the provider supports software emulated events. If a provider is capable
> of hardware level commits to persistent memory, the transaction should be consumed
> transparently by the hardware, and does not need to generate an event at the target.
> This will require an additional event definition in libfabric (See definition for
> fi_eq_commit_entry)

This too needs to be defined based on the application level semantics, not implementation.  The app should not be aware of implementation differences, except where mode bits dictate for performance reasons.  (And I can say that developers hate dealing with those differences, so we need to eliminate them.)

If we limit commit to RMA transfers, it makes sense for it to act as an RMA call for most purposes (i.e. fi_readmsg/fi_writemsg).  For example, the ability to carry CQ data and generate remote events (FI_RMA_EVENTS) on the target CQ and counters.  We also need to consider if there's any impact on counters associated with the MR.

> *	A new EQ event definition (fi_eq_commit_entry) to support software-emulated
> persistence for devices that cannot provide hardware support
> 
> 	*	The iov, and count variables mirror the original iov, and count contents of
> the originating request.
> 	*	The flags may be a diminished set of flags from the original transaction
> under the assumption that only some flags would have meaning at the target and sending
> originator-only flags to the target would have little value to the target process.

If any events are generated, they need to be CQ related, not EQ.

> *	Additional flags or capabilities
> 
> 	*	A provider should be able to indicate whether they support software
> emulated notifications of fi_commit, or whether they can handle hardware requests for
> commits to persistent memory

The implementation of hardware vs software should not be exposed.  Hybrid solutions (e.g. RxM or large transfers over verbs devices) are also possible.

> 		*	An additional flag should be introduced to the fi_info structure
> under modes: FI_COMMIT_MANUAL (or something else)

The FI_RMA_PMEM capability should be sufficient to indicate support for RMA reads and writes to persistent memory.  That should be an inclusive flag (along with the API version) indicating that all related operations are supported.

> 			*	This flag would indicate to the application that events may be
> generated to the event queue for consumption by the application. Commit events would be
> generated upon receipt of a commit message from a remote peer, and the application
> would be responsible for handling the event.
> 			*	Lack of the FI_COMMIT_MANUAL flag, and the presence of the
> FI_RMA_PMEM (or FI_PMEM) flag in the info structure should imply that the hardware is
> capable of handling the commit requests to persistent memory and the application does
> not need to read the event queue for commit events.
> 
> *	Change of flag definition
> 
> 	*	The FI_RMA_PMEM flag should be changed to FI_PMEM to indicate that the
> provider is PMEM aware, and supports RMA/AMO/MSG operations to and from persistent
> memory.
> 	*	There may be little value in supporting messaging interfaces, but it is
> something that could supported.

Support for messaging requires additional definitions.  Part of the discussion is figuring out the scope of what should be defined in the short term.  As mentioned above, FI_FENCE | FI_COMMIT_COMPLETE can be used to commit message transfers.  I can't think of a better alternative here.  However, I'm not sure if the proposed IBTA and IETF specifications will result in hardware capable of supporting the FI_FENCE | FI_COMMIT_COMPLETE semantic.  :/

> *	Addition of an event handler registration for handling event queue entries within
> the provider context (See Definition: fi_eq_event_handler)
> 
> 	*	Essentially, this becomes a registered callback for the target application
> to handle specific event types. We can use this mechanism with the target application
> to allow the provider to handle events internally using a function provided by the
> application. The function would contain the logic necessary to handle the event

Callbacks are to be avoided.  They present difficult locking scenarios with severe restrictions on what the application can do from the callback, and present challenging object destruction situations.  Those restrictions can be difficult for an application to enforce, since calls outside the app to other libraries may violate them.

> 	*	Specific to PMEM, a function handler would be used by the target
> application to handle commits to persistent memory as they were delivered without
> requiring a fi_eq_read and some form of acknowledgement around the commit action. With
> the handler, the commit could be handled entirely by the function provided by the
> application, and the return code from the application provided call-back would be
> sufficient for a software emulation in the provider to produce the return message to
> the sender that the commit transaction is fully complete. The use of a handler allows
> us to make the commit transaction as light-weight, or heavy-weight as necessary.
> 
> Definitions:
> 
> fi_commit
> 
> ssize_t fi_commit(struct fid_ep *ep,
> 
>                              const struct fi_rma_iov *iov,
> 
>                              size_t count,
> 
>                              fi_addr_t dest_addr,
> 
>                              uint64_t flags,
> 
>                              void *context);
> 
> fi_eq_commit_entry
> 
> struct fi_eq_commit_entry {
> 
>     fid_t                       fid;            /* fid associated with request */
> 
>     const struct fi_rma_iov    *iov;            /* iovec of memory regions to be
> committed to persistent memory */
> 
>     size_t                      count;          /* number of iovec/key entries */
> 
>     uint64_t                    flags;          /* operation-specific flags */
> 
> };
> 
> fi_eq_event_handler
> 
> typedef ssize_t (*fi_eq_event_handler_t)(struct fid_eq *eq,
> 
>     uint64_t event_type,
> 
>     void *event_data,
> 
>     uint64_t len,
> 
>     void *context);
> 
> ssize_t fi_eq_register_handler(struct fid_eq *eq,
> 
>     uint64_t event_type,
> 
>     fi_eq_event_handler_t handler,
> 
>     void *context);
> 
> Use cases supported by this proposal:
> 
> *	As an application writer, I need to commit multiple previously-sent data
> transfers to the persistence domain

To be clear, the proposal only supports RMA writes, and maybe atomics, to the target memory.  That is likely sufficient for now, but I'd like to ensure that we have a way to extend pmem support beyond the limited use cases being discussed.

> 	*	Previous functionality allows for a commit for every message as is the case
> for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a per-transaction basis. The need in
> this use case is performance-oriented, to allow less strict delivery model to the NIC
> for most messages followed up with a 'flush' of the NIC to the persistence domain. This
> allows most messages targeted to the persistence domain to complete with a less strict
> delivery model, and provides a mechanism to ensure that those data transfers are
> eventually persisted.

Unless the app has set FI_COMMIT_COMPLETE as the default completion model, it only applies to the operation on which it was set.  The main gap I'm aware of with proposed specifications is support of a 'flush' type semantic.

- Sean