[ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Grun, Paul paul.grun at hpe.com
Fri May 1 11:13:54 PDT 2020


Keep in mind that the libfabric API doesn't necessarily directly mimic what is implemented in verbs. The requirement is that the verbs semantics are implementable via a libfabric implementation.  I think of libfabric as being a slightly more abstract interface than verbs, hence the libfabric APIs don't necessarily expose the gritty details described in the current IBTA Annex.

> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of Douglas,
> Chet R
> Sent: Friday, May 1, 2020 7:51 AM
> To: Rupert Dance - SFI <rsdance at soft-forge.com>; Swaro, James E
> <james.swaro at hpe.com>; Hefty, Sean <sean.hefty at intel.com>;
> ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] Proposal for enhancement to support additional Persistent
> Memory use cases (ofiwg/libfabric#5874)
> 
> It matters!  Eventually (now?) we want full RDMA extension support in libfabrics,
> libibverbs, and the verbs spec.  This appears to be based on Intel's original
> libfabric proposal?  Commit is not a valid term. Complete RDMA memory
> placement extension support looks different than it did in that original proposal.
> We need to architect the complete solution.  Don’t we?  Does it support RDMA
> Flush, Write Atomic and Verify?  How do you register cached vs uncached
> pmem?  Is this already in the wild?  If not we shouldn’t release it without further
> consideration.
> 
> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of Rupert Dance
> - SFI
> Sent: Friday, May 01, 2020 8:31 AM
> To: 'Swaro, James E' <james.swaro at hpe.com>; Hefty, Sean
> <sean.hefty at intel.com>; ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] Proposal for enhancement to support additional Persistent
> Memory use cases (ofiwg/libfabric#5874)
> 
> Is this team aware of what the IBTA is doing with PME or does it not matter
> since it is libfabrics?
> 
> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of Swaro, James
> E
> Sent: Friday, May 01, 2020 9:41 AM
> To: Hefty, Sean <sean.hefty at intel.com>; ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] Proposal for enhancement to support additional Persistent
> Memory use cases (ofiwg/libfabric#5874)
> 
> >    > 	*	This allows a memory region to be registered as being capable
> of
> >    > persistence. This has already been introduced into the upstream libfabric
> GITHUB, but
> >    > should be reviewed to ensure it matches use case requirements.
> 
> >    FI_RMA_PMEM is defined as a MR flag.  Note that this definition
> intentionally limits non-RMA transfers from taking advantage of persistent
> memory semantics.
> 
> >    The intent of this flag is to give providers implementation flexibility,
> specifically based on hardware/software differences.
> 
> Understood. The intent of this section of the proposal was to outline potential
> areas for change. Any questions posed here were historical and meant to
> provoke discussion. They might even be a little dated. Those changes and the
> rationale are discussed below.
> 
> 
> >    > every operation. That type of operation would make the delivery of
> completion events
> >    > take longer than necessary for most operations, so SHMEM would need
> finer control over
> >    > commit flushing behavior.
> 
> >    OFI does not require that an event be generated for every transfer.  It also
> allows transfers to report completions using 'lower' completion semantics, such
> as FI_TRANSMIT_COMPLETE.  Completion events at the target of an RMA write
> requires the FI_RMA_EVENT capability, and is independent from PMEM.
> 
> Understood. This paragraph was intended to address a complication that was
> raised in one of the meetings.
> 
> It was discussed that with some applications, all or most data would be required
> to be persistent. The solution at the time was to provide
> FI_COMMIT_COMPLETE as part of the default TX op_flags at the time, which
> would incur a higher cost to provide that level of completion. The goal with this
> proposal would be to allow upper layers to set a less strict completion model,
> such as delivery or transmit complete as part of the default op_flag, or per-
> operation flag and address persistence as a batch operation via the fi_commit
> API.
> 
> 
> >    > 	*	A single request to fi_commit should generate a control
> message to target
> >    > hardware or software emulation environment to flush the contents of
> memory targets.
> 
> >    This needs to be defined in terms of application level semantics, not
> implementation details.  fi_commit could be a no-op based on the provider
> implementation.  (It actually would be for the socket and tcp providers, which
> act at the target based on the MR flag.)
> 
> Completely agree. Rereading this proposal, I meant to change some of these
> discussion points away from implementation to a discussion on behavior and
> semantics. How fi_commit behaves w.r.t implementation specifics isn't  within
> the scope of this proposal. Implementation details are something I'd prefer to
> stay away from so we can define how we expect it to behave.
> 
> >     > flexibility in the API design to future proof against options we might not
> conceive of
> >    > until after the prototype is complete, and the context available for the user
> and
> >    > returned with the completion
> 
> >    The proposed definition is limited to RMA (and atomic) writes.  There is no
> mechanism for handling RMA reads into persistent memory, for example.  That
> should be included.  Message transfers may need a separate mechanism for this.
> That can be deferred (left undefined by the man pages), but should ideally we
> should have an idea for how to support it.
> 
> >    The best existing API definition for an fi_commit call would be the
> fi_readmsg/fi_writemsg() calls.  We could even re-use those calls by adding a
> flag.
> 
> The proposed definition is limited to RMA and AMO because we didn't have a
> strong use case for messaging, but I'd like to go the route that allows messaging
> to be easily included if that changes later down the road.
> 
> 
> >    > 	*	Since this API behaves like a data transfer API, it is expected that
> this
> >    > API would generate a completion event to the local completion queue
> associated with the
> >    > EP from which the transaction was initiated against.
> 
> >    The generation of a *CQ* event makes sense.  We need to define if and how
> counters, locally and remote, are updated.  EQ events are not the right API
> match.
> 
> Agreed on the CQ aspect. As a note, EQs are not being discussed for the
> initiator, only the target, so I'll put my EQ comments in the next comment. As a
> general comment, I think that this could be a good candidate for discussion at
> the next OFIWG because it is a strange grey area to me.
> 
> >    > 	*	At the target, this should generate an event to the target's
> event queue –
> >    > if and only if the provider supports software emulated events. If a provider
> is capable
> >    > of hardware level commits to persistent memory, the transaction should be
> consumed
> >    > transparently by the hardware, and does not need to generate an event at
> the target.
> >    > This will require an additional event definition in libfabric (See definition for
> >    > fi_eq_commit_entry)
> 
> >    This too needs to be defined based on the application level semantics, not
> implementation.  The app should not be aware of implementation differences,
> except where mode bits dictate for performance reasons.  (And I can say that
> developers hate dealing with those differences, so we need to eliminate them.)
> 
> >    If we limit commit to RMA transfers, it makes sense for it to act as an RMA
> call for most purposes (i.e. fi_readmsg/fi_writemsg).  For example, the ability to
> carry CQ data and generate remote events (FI_RMA_EVENTS) on the target CQ
> and counters.  We also need to consider if there's any impact on counters
> associated with the MR.
> 
> I agree that this needs to be defined in terms of application-level behavior.
> However, I do think we need to talk about if and how applications should be
> expected to facilitate the desired functionality if the hardware is not capable of
> it.  The 'how' aspect of a provider like sockets implements the functionality isn't
> important to define here, but if the provider needs the application to
> interact/configure in a specific way then I think that should be covered here. If
> there isn’t hardware support for FI_COMMIT_COMPLETE, then it seems to
> become a much more difficult problem. Libfabric could provide events to the
> application through EQ or CQ events, or go a similar route as HMEM is going
> now. I'd prefer to provide events to the application rather than attempt to
> support every PMEM library/hardware when handling the software emulation
> case.
> 
> >    > *	A new EQ event definition (fi_eq_commit_entry) to support software-
> emulated
> >    > persistence for devices that cannot provide hardware support
> >    >
> >    > 	*	The iov, and count variables mirror the original iov, and count
> contents of
> >    > the originating request.
> >    > 	*	The flags may be a diminished set of flags from the original
> transaction
> >    > under the assumption that only some flags would have meaning at the
> target and sending
> >    > originator-only flags to the target would have little value to the target
> process.
> 
> >    If any events are generated, they need to be CQ related, not EQ.
> 
> This is where I believe it becomes a grey area. I could see using FI_RMA_EVENT
> or something similar to provoke a CQ event generated at the target, but it
> doesn't feel like fi_commit is a data transfer operation. It seems like a control
> operation, which is another reason why it was defined as generating an EQ
> event. The commit/"flush" is a control operation so it feels aligned with EQ.
> 
> 
> >    > *	Additional flags or capabilities
> >    >
> >    > 	*	A provider should be able to indicate whether they support
> software
> >    > emulated notifications of fi_commit, or whether they can handle hardware
> requests for
> >    > commits to persistent memory
> 
> >    The implementation of hardware vs software should not be exposed.  Hybrid
> solutions (e.g. RxM or large transfers over verbs devices) are also possible.
> 
> If libfabric provides an event to the upper layer, I believe libfabric can support
> many more persistent memory models and devices by propagating events to the
> upper layer than if we attempt to put that capability into libfabric and support it
> transparently for the user. It's just my view, but application writers have asked
> us to optimize data transfers over the network with the abstraction we provide.
> I think. This could be another complicated topic and we could discuss it at the
> next OFIWG.
> 
> 
> >     The FI_RMA_PMEM capability should be sufficient to indicate support for
> RMA reads and writes to persistent memory.  That should be an inclusive flag
> (along with the API version) indicating that all related operations are supported.
> 
> Something like this?
> 
> #define FI_PMEM  (FI_RMA_PMEM | FI_AMO_PMEM | FI_MSG_PMEM)
> 
> 
> >      Support for messaging requires additional definitions.  Part of the discussion
> is figuring out the scope of what should be defined in the short term.  As
> mentioned above, FI_FENCE | FI_COMMIT_COMPLETE can be used to commit
> message transfers.  I can't think of a better alternative here.  However, I'm not
> sure if the proposed IBTA and IETF specifications will result in hardware capable
> of supporting the FI_FENCE | FI_COMMIT_COMPLETE semantic.  :/
> 
> 
> Agreed on messaging, but it lacks a good use case yet so I haven't been as
> concerned.
> 
> I'm not yet convinced on FI_COMMIT_COMPLETE|FI_FENCE. If libfabric
> suggested the use of that, does that imply that providers must support 0-length
> sends and/or control messaging on behalf of the application ? Does the data
> transfer itself provide any context to the region being flushed? What happens in
> the case of multiple persistent memory domains or devices? How would that
> data transfer provide the context necessary to flush a specific region, memory
> domain, or device? This seems more complicated than the initial suggestion
> indicates.
> 
> >    > *	Addition of an event handler registration for handling event queue
> entries within
> >    > the provider context (See Definition: fi_eq_event_handler)
> >    >
> >    > 	*	Essentially, this becomes a registered callback for the target
> application
> >    > to handle specific event types. We can use this mechanism with the target
> application
> >    > to allow the provider to handle events internally using a function provided
> by the
> >    > application. The function would contain the logic necessary to handle the
> event
> 
> >    Callbacks are to be avoided.  They present difficult locking scenarios with
> severe restrictions on what the application can do from the callback, and
> present challenging object destruction situations.  Those restrictions can be
> difficult for an application to enforce, since calls outside the app to other
> libraries may violate them.
> 
> It's a good argument, and generally I feel the same way. What do you suggest as
> an alternative? Callbacks were suggest as a way for the provider to do some
> behavior on behalf of the application upon the receipt of the associated event.
> This would have allowed the provider to issue the commit/flush to device and
> then return the ACK back to the initiator that the commit had succeeded/data
> was flushed as requested. Without a callback, I do not see a clean way for
> libfabric to coordinate flush and acknowledgement back to the initiator.
> 
> >    To be clear, the proposal only supports RMA writes, and maybe atomics, to
> the target memory.  That is likely sufficient for now, but I'd like to ensure that
> we have a way to extend pmem support beyond the limited use cases being
> discussed.
> 
> RMA, and atomics -- with the intent not to exclude messaging. This is why the
> naming change from FI_RMA_PMEM to FI_PMEM was suggested.
> 
> 
> >    > 	*	Previous functionality allows for a commit for every message as
> is the case
> >    > for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a per-
> transaction basis. The need in
> >    >  ...
> >    > delivery model, and provides a mechanism to ensure that those data
> transfers are
> >    > eventually persisted.
> 
> >    Unless the app has set FI_COMMIT_COMPLETE as the default completion
> model, it only applies to the operation on which it was set.  The main gap I'm
> aware of with proposed specifications is support of a 'flush' type semantic.
> 
> The flush mechanic is the primary gap that the proposal is attempting to identify.
> However, I believe the software emulation elements of the proposal are
> valuable for prototyping efforts.
> 
> --
> James Swaro
> P: +1 (651) 605-9000
> 
> On 4/27/20, 9:38 PM, "Hefty, Sean" <sean.hefty at intel.com> wrote:
> 
>     Top-posting main discussion point.  Other comments further down:
> 
>     Conceptually, what's being proposed is specifying a data transfer as a 2-step
> process.
> 
>     1. identify the data source and target
>     2. specify the completion semantic
> 
>     Theoretically, the actual data transfer can occur any time after step 1 and
> before step 2 completes.  As an additional optimization, step 2 can apply to
> multiple step 1s.
> 
>     We need to decide:
> 
>     A. What completion semantic applies to step 1?
>     B. What operations do we support for step 1?
>     C. What completion semantics are supported for step 2?
> 
>     The current answers are:
> 
>     A. All completion levels are supported.  It's possible that none of them are
> desirable here, and we need to introduce a new mode:
> FI_UNDEFINED_COMPLETE.  This would indicate that the buffer cannot be re-
> used, and the data is not visible at the target, until step 2 completes that covers
> the same target memory range.
> 
>     B. RMA reads and writes are supported.  It shouldn't be difficult to support
> atomics through the same APIs as well.  Message transfers are more difficult to
> specify in step 2, making them harder to support.
> 
>     C. The proposal only supports FI_COMMIT_COMPLETE.  Other levels could be
> added, though that may only make sense if we define something like
> FI_UNDEFINED_COMPLETE.
> 
>     I'm throwing FI_UNDEFINED_COMPLETE out for discussion.  There would be
> issues trying to define it, since data transfers issued at step 1 could generate
> completions locally and remotely prior to step 2 being invoked.  Those
> completions just wouldn't mean anything until step 2 completes.  The provider
> would select the best completion option for step 1.
> 
> 
>     > Libfabric requires modifications to support RMA and atomic operations
> targeted at
>     > remote memory registrations backed by persistent memory devices. These
> modifications
>     > should be made with the intent to drive support for persistent memory
> usage by
>     > applications that rely on communications middleware such as SHMEM in a
> manner that is
>     > consistent with byte-based/stream-based addressable memory formats.
> Existing proposals
>     > (initial proposal) support NVMe/PMoF approaches, which this approach
> should support
>     > flat memory, non-block addressed memory structures and devices.
>     >
>     > Changes may be required in as many as three areas:
>     >
>     > *	Memory registration calls
>     >
>     > 	*	This allows a memory region to be registered as being capable
> of
>     > persistence. This has already been introduced into the upstream libfabric
> GITHUB, but
>     > should be reviewed to ensure it matches use case requirements.
> 
>     FI_RMA_PMEM is defined as a MR flag.  Note that this definition intentionally
> limits non-RMA transfers from taking advantage of persistent memory
> semantics.
> 
>     The intent of this flag is to give providers implementation flexibility,
> specifically based on hardware/software differences.
> 
> 
>     > *	Completion semantics
>     >
>     > 	*	These changes allow a completion event or notification to be
> deferred until
>     > the referenced data has reached the persistence domain at the target. This
> has already
>     > been introduced into the upstream libfabric GITHUB, but should be reviewed
> to ensure it
>     > matches use case requirements.
> 
>     Completion semantics may be adjusted on a per transfer basis.  The
> FI_COMMMIT_COMPLETE semantic applies to both the initiator and target.
> Completion semantics are a minimal guarantee from a provider.  The provider
> can do more.
> 
>     > *	Consumer control of persistence
>     >
>     > 	*	As presently implemented in the upstream libfabric GITHUB,
> persistence is
>     > determined on a transaction-by-transaction basis. It was acknowledged at
> the time that
>     > this is a simplistic implementation. We need to reach consensus on the
> following:
>     >
>     > 		*	Should persistence be signaled on the basis of the
> target memory
>     > region? For example, one can imagine a scheme where data targeted at a
> particular
>     > memory region is automatically pushed into the persistence domain by the
> target,
>     > obviating the need for any sort of commit operation.
> 
>     In cases where a commit operation is not needed, it can become a no-op, but
> it may be required functionality for some providers.
> 
> 
>     > 		*	Is an explicit 'commit' operation of some type required,
> and if so,
>     > what is the scope of that commit operation? Is there a persistence fence
> defined such
>     > that every operation prior to the fence is made persistent by a commit
> operation?
> 
>     With the current API, persistence can be achieved by issuing a 0-length RMA
> with FI_COMMIT_COMPLETE | FI_FENCE semantics.  The fence requires that
> *all* prior transfers over that endpoint meet the requested completion
> semantic.
> 
>     This may not be ideal, but may be the best way to handle message transfers to
> persistent memory.
> 
> 
>     > Proposal
>     >
>     > The experimental work in the OFIWG/libfabric branch is sufficient for the
> needs of
>     > SHMEM, with exception to the granularity of event generation. When the
> current
>     > implementation generates events, it would generate commit-level
> completion events with
>     > every operation. That type of operation would make the delivery of
> completion events
>     > take longer than necessary for most operations, so SHMEM would need
> finer control over
>     > commit flushing behavior.
> 
>     OFI does not require that an event be generated for every transfer.  It also
> allows transfers to report completions using 'lower' completion semantics, such
> as FI_TRANSMIT_COMPLETE.  Completion events at the target of an RMA write
> requires the FI_RMA_EVENT capability, and is independent from PMEM.
> 
>     > To satisfy this, the following is being proposed:
>     >
>     > *	A new API: fi_commit (See definitions: fi_commit)
>     > 	The new API would be used to generate a commit instruction to a target
> peer. The
>     > instruction would be defined by a set of memory registration keys, or
> regions by which
>     > the target could issue a commit to persistent memory.
> 
>     See discussion at the top.
> 
> 
>     > 	*	A single request to fi_commit should generate a control
> message to target
>     > hardware or software emulation environment to flush the contents of
> memory targets.
> 
>     This needs to be defined in terms of application level semantics, not
> implementation details.  fi_commit could be a no-op based on the provider
> implementation.  (It actually would be for the socket and tcp providers, which
> act at the target based on the MR flag.)
> 
>     > Memory targets are defined by the iov structures, and key fields – and the
> number of
>     > memory targets are defined by the count field. The destination address is
> handled by
>     > the dest_addr field. The flags field is held reserved at this time to allow for
>     > flexibility in the API design to future proof against options we might not
> conceive of
>     > until after the prototype is complete, and the context available for the user
> and
>     > returned with the completion
> 
>     The proposed definition is limited to RMA (and atomic) writes.  There is no
> mechanism for handling RMA reads into persistent memory, for example.  That
> should be included.  Message transfers may need a separate mechanism for this.
> That can be deferred (left undefined by the man pages), but should ideally we
> should have an idea for how to support it.
> 
>     The best existing API definition for an fi_commit call would be the
> fi_readmsg/fi_writemsg() calls.  We could even re-use those calls by adding a
> flag.
> 
>     > 	*	Since this API behaves like a data transfer API, it is expected that
> this
>     > API would generate a completion event to the local completion queue
> associated with the
>     > EP from which the transaction was initiated against.
> 
>     The generation of a *CQ* event makes sense.  We need to define if and how
> counters, locally and remote, are updated.  EQ events are not the right API
> match.
> 
> 
>     > 	*	At the target, this should generate an event to the target's
> event queue –
>     > if and only if the provider supports software emulated events. If a provider is
> capable
>     > of hardware level commits to persistent memory, the transaction should be
> consumed
>     > transparently by the hardware, and does not need to generate an event at
> the target.
>     > This will require an additional event definition in libfabric (See definition for
>     > fi_eq_commit_entry)
> 
>     This too needs to be defined based on the application level semantics, not
> implementation.  The app should not be aware of implementation differences,
> except where mode bits dictate for performance reasons.  (And I can say that
> developers hate dealing with those differences, so we need to eliminate them.)
> 
>     If we limit commit to RMA transfers, it makes sense for it to act as an RMA
> call for most purposes (i.e. fi_readmsg/fi_writemsg).  For example, the ability to
> carry CQ data and generate remote events (FI_RMA_EVENTS) on the target CQ
> and counters.  We also need to consider if there's any impact on counters
> associated with the MR.
> 
> 
>     > *	A new EQ event definition (fi_eq_commit_entry) to support software-
> emulated
>     > persistence for devices that cannot provide hardware support
>     >
>     > 	*	The iov, and count variables mirror the original iov, and count
> contents of
>     > the originating request.
>     > 	*	The flags may be a diminished set of flags from the original
> transaction
>     > under the assumption that only some flags would have meaning at the
> target and sending
>     > originator-only flags to the target would have little value to the target
> process.
> 
>     If any events are generated, they need to be CQ related, not EQ.
> 
> 
>     > *	Additional flags or capabilities
>     >
>     > 	*	A provider should be able to indicate whether they support
> software
>     > emulated notifications of fi_commit, or whether they can handle hardware
> requests for
>     > commits to persistent memory
> 
>     The implementation of hardware vs software should not be exposed.  Hybrid
> solutions (e.g. RxM or large transfers over verbs devices) are also possible.
> 
> 
>     > 		*	An additional flag should be introduced to the fi_info
> structure
>     > under modes: FI_COMMIT_MANUAL (or something else)
> 
>     The FI_RMA_PMEM capability should be sufficient to indicate support for RMA
> reads and writes to persistent memory.  That should be an inclusive flag (along
> with the API version) indicating that all related operations are supported.
> 
> 
>     > 			*	This flag would indicate to the application that
> events may be
>     > generated to the event queue for consumption by the application. Commit
> events would be
>     > generated upon receipt of a commit message from a remote peer, and the
> application
>     > would be responsible for handling the event.
>     > 			*	Lack of the FI_COMMIT_MANUAL flag, and the
> presence of the
>     > FI_RMA_PMEM (or FI_PMEM) flag in the info structure should imply that the
> hardware is
>     > capable of handling the commit requests to persistent memory and the
> application does
>     > not need to read the event queue for commit events.
>     >
>     > *	Change of flag definition
>     >
>     > 	*	The FI_RMA_PMEM flag should be changed to FI_PMEM to
> indicate that the
>     > provider is PMEM aware, and supports RMA/AMO/MSG operations to and
> from persistent
>     > memory.
>     > 	*	There may be little value in supporting messaging interfaces,
> but it is
>     > something that could supported.
> 
>     Support for messaging requires additional definitions.  Part of the discussion is
> figuring out the scope of what should be defined in the short term.  As
> mentioned above, FI_FENCE | FI_COMMIT_COMPLETE can be used to commit
> message transfers.  I can't think of a better alternative here.  However, I'm not
> sure if the proposed IBTA and IETF specifications will result in hardware capable
> of supporting the FI_FENCE | FI_COMMIT_COMPLETE semantic.  :/
> 
> 
>     > *	Addition of an event handler registration for handling event queue
> entries within
>     > the provider context (See Definition: fi_eq_event_handler)
>     >
>     > 	*	Essentially, this becomes a registered callback for the target
> application
>     > to handle specific event types. We can use this mechanism with the target
> application
>     > to allow the provider to handle events internally using a function provided by
> the
>     > application. The function would contain the logic necessary to handle the
> event
> 
>     Callbacks are to be avoided.  They present difficult locking scenarios with
> severe restrictions on what the application can do from the callback, and
> present challenging object destruction situations.  Those restrictions can be
> difficult for an application to enforce, since calls outside the app to other
> libraries may violate them.
> 
> 
>     > 	*	Specific to PMEM, a function handler would be used by the
> target
>     > application to handle commits to persistent memory as they were delivered
> without
>     > requiring a fi_eq_read and some form of acknowledgement around the
> commit action. With
>     > the handler, the commit could be handled entirely by the function provided
> by the
>     > application, and the return code from the application provided call-back
> would be
>     > sufficient for a software emulation in the provider to produce the return
> message to
>     > the sender that the commit transaction is fully complete. The use of a
> handler allows
>     > us to make the commit transaction as light-weight, or heavy-weight as
> necessary.
>     >
>     > Definitions:
>     >
>     > fi_commit
>     >
>     > ssize_t fi_commit(struct fid_ep *ep,
>     >
>     >                              const struct fi_rma_iov *iov,
>     >
>     >                              size_t count,
>     >
>     >                              fi_addr_t dest_addr,
>     >
>     >                              uint64_t flags,
>     >
>     >                              void *context);
>     >
>     > fi_eq_commit_entry
>     >
>     > struct fi_eq_commit_entry {
>     >
>     >     fid_t                       fid;            /* fid associated with request */
>     >
>     >     const struct fi_rma_iov    *iov;            /* iovec of memory regions to be
>     > committed to persistent memory */
>     >
>     >     size_t                      count;          /* number of iovec/key entries */
>     >
>     >     uint64_t                    flags;          /* operation-specific flags */
>     >
>     > };
>     >
>     > fi_eq_event_handler
>     >
>     > typedef ssize_t (*fi_eq_event_handler_t)(struct fid_eq *eq,
>     >
>     >     uint64_t event_type,
>     >
>     >     void *event_data,
>     >
>     >     uint64_t len,
>     >
>     >     void *context);
>     >
>     > ssize_t fi_eq_register_handler(struct fid_eq *eq,
>     >
>     >     uint64_t event_type,
>     >
>     >     fi_eq_event_handler_t handler,
>     >
>     >     void *context);
>     >
>     > Use cases supported by this proposal:
>     >
>     > *	As an application writer, I need to commit multiple previously-sent data
>     > transfers to the persistence domain
> 
>     To be clear, the proposal only supports RMA writes, and maybe atomics, to
> the target memory.  That is likely sufficient for now, but I'd like to ensure that
> we have a way to extend pmem support beyond the limited use cases being
> discussed.
> 
> 
>     > 	*	Previous functionality allows for a commit for every message as
> is the case
>     > for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a per-transaction
> basis. The need in
>     > this use case is performance-oriented, to allow less strict delivery model to
> the NIC
>     > for most messages followed up with a 'flush' of the NIC to the persistence
> domain. This
>     > allows most messages targeted to the persistence domain to complete with
> a less strict
>     > delivery model, and provides a mechanism to ensure that those data
> transfers are
>     > eventually persisted.
> 
>     Unless the app has set FI_COMMIT_COMPLETE as the default completion
> model, it only applies to the operation on which it was set.  The main gap I'm
> aware of with proposed specifications is support of a 'flush' type semantic.
> 
> 
>     - Sean
> 
> 
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> INVALID URI REMOVED
> 3A__lists.openfabrics.org_mailman_listinfo_ofiwg&d=DwIGaQ&c=C5b8zRQO1
> miGmBeVZ2LFWg&r=Gu85MpS7ImGmwh9TaJU-
> rXwAoPzObckoDNIQpAj4MDo&m=Qj08qjdDK2mAzsmlUcJSi6FH3QDFiIz5O7BNvH
> aCvTs&s=PbOj9sBPeYA9Giq_DII7GYYyoLXmwpPiOWLwIylEGrQ&e=
> 
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> INVALID URI REMOVED
> 3A__lists.openfabrics.org_mailman_listinfo_ofiwg&d=DwIGaQ&c=C5b8zRQO1
> miGmBeVZ2LFWg&r=Gu85MpS7ImGmwh9TaJU-
> rXwAoPzObckoDNIQpAj4MDo&m=Qj08qjdDK2mAzsmlUcJSi6FH3QDFiIz5O7BNvH
> aCvTs&s=PbOj9sBPeYA9Giq_DII7GYYyoLXmwpPiOWLwIylEGrQ&e=
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> INVALID URI REMOVED
> 3A__lists.openfabrics.org_mailman_listinfo_ofiwg&d=DwIGaQ&c=C5b8zRQO1
> miGmBeVZ2LFWg&r=Gu85MpS7ImGmwh9TaJU-
> rXwAoPzObckoDNIQpAj4MDo&m=Qj08qjdDK2mAzsmlUcJSi6FH3QDFiIz5O7BNvH
> aCvTs&s=PbOj9sBPeYA9Giq_DII7GYYyoLXmwpPiOWLwIylEGrQ&e=


More information about the ofiwg mailing list