[ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Mon May 4 07:44:23 PDT 2020

I understand.  But I don’t think we should move forward with any pmem additions until we at least talk about it.  Have the IBTA and IETF drafts been taken into account in whats being proposed?

-----Original Message-----
From: Grun, Paul <paul.grun at hpe.com> 
Sent: Friday, May 01, 2020 12:14 PM
To: Douglas, Chet R <chet.r.douglas at intel.com>; Rupert Dance - SFI <rsdance at soft-forge.com>; Swaro, James E <james.swaro at hpe.com>; Hefty, Sean <sean.hefty at intel.com>; ofiwg at lists.openfabrics.org
Subject: RE: [ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Keep in mind that the libfabric API doesn't necessarily directly mimic what is implemented in verbs. The requirement is that the verbs semantics are implementable via a libfabric implementation.  I think of libfabric as being a slightly more abstract interface than verbs, hence the libfabric APIs don't necessarily expose the gritty details described in the current IBTA Annex.

> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of 
> Douglas, Chet R
> Sent: Friday, May 1, 2020 7:51 AM
> To: Rupert Dance - SFI <rsdance at soft-forge.com>; Swaro, James E 
> <james.swaro at hpe.com>; Hefty, Sean <sean.hefty at intel.com>; 
> ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] Proposal for enhancement to support additional 
> Persistent Memory use cases (ofiwg/libfabric#5874)
> 
> It matters!  Eventually (now?) we want full RDMA extension support in 
> libfabrics, libibverbs, and the verbs spec.  This appears to be based 
> on Intel's original libfabric proposal?  Commit is not a valid term. 
> Complete RDMA memory placement extension support looks different than it did in that original proposal.
> We need to architect the complete solution.  Don’t we?  Does it 
> support RDMA Flush, Write Atomic and Verify?  How do you register 
> cached vs uncached pmem?  Is this already in the wild?  If not we 
> shouldn’t release it without further consideration.
> 
> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of Rupert 
> Dance
> - SFI
> Sent: Friday, May 01, 2020 8:31 AM
> To: 'Swaro, James E' <james.swaro at hpe.com>; Hefty, Sean 
> <sean.hefty at intel.com>; ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] Proposal for enhancement to support additional 
> Persistent Memory use cases (ofiwg/libfabric#5874)
> 
> Is this team aware of what the IBTA is doing with PME or does it not 
> matter since it is libfabrics?
> 
> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of Swaro, 
> James E
> Sent: Friday, May 01, 2020 9:41 AM
> To: Hefty, Sean <sean.hefty at intel.com>; ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] Proposal for enhancement to support additional 
> Persistent Memory use cases (ofiwg/libfabric#5874)
> 
> >    > *This allows a memory region to be registered as being capable
> of
> >    > persistence. This has already been introduced into the upstream 
> > libfabric
> GITHUB, but
> >    > should be reviewed to ensure it matches use case requirements.
> 
> >    FI_RMA_PMEM is defined as a MR flag.  Note that this definition
> intentionally limits non-RMA transfers from taking advantage of 
> persistent memory semantics.
> 
> >    The intent of this flag is to give providers implementation 
> > flexibility,
> specifically based on hardware/software differences.
> 
> Understood. The intent of this section of the proposal was to outline 
> potential areas for change. Any questions posed here were historical 
> and meant to provoke discussion. They might even be a little dated. 
> Those changes and the rationale are discussed below.
> 
> 
> >    > every operation. That type of operation would make the delivery 
> > of
> completion events
> >    > take longer than necessary for most operations, so SHMEM would 
> > need
> finer control over
> >    > commit flushing behavior.
> 
> >    OFI does not require that an event be generated for every 
> > transfer.  It also
> allows transfers to report completions using 'lower' completion 
> semantics, such as FI_TRANSMIT_COMPLETE.  Completion events at the 
> target of an RMA write requires the FI_RMA_EVENT capability, and is independent from PMEM.
> 
> Understood. This paragraph was intended to address a complication that 
> was raised in one of the meetings.
> 
> It was discussed that with some applications, all or most data would 
> be required to be persistent. The solution at the time was to provide 
> FI_COMMIT_COMPLETE as part of the default TX op_flags at the time, 
> which would incur a higher cost to provide that level of completion. 
> The goal with this proposal would be to allow upper layers to set a 
> less strict completion model, such as delivery or transmit complete as 
> part of the default op_flag, or per- operation flag and address 
> persistence as a batch operation via the fi_commit API.
> 
> 
> >    > *A single request to fi_commit should generate a control
> message to target
> >    > hardware or software emulation environment to flush the 
> > contents of
> memory targets.
> 
> >    This needs to be defined in terms of application level semantics, 
> > not
> implementation details.  fi_commit could be a no-op based on the 
> provider implementation.  (It actually would be for the socket and tcp 
> providers, which act at the target based on the MR flag.)
> 
> Completely agree. Rereading this proposal, I meant to change some of 
> these discussion points away from implementation to a discussion on 
> behavior and semantics. How fi_commit behaves w.r.t implementation 
> specifics isn't  within the scope of this proposal. Implementation 
> details are something I'd prefer to stay away from so we can define how we expect it to behave.
> 
> >     > flexibility in the API design to future proof against options 
> > we might not
> conceive of
> >    > until after the prototype is complete, and the context 
> > available for the user
> and
> >    > returned with the completion
> 
> >    The proposed definition is limited to RMA (and atomic) writes.  
> > There is no
> mechanism for handling RMA reads into persistent memory, for example.  
> That should be included.  Message transfers may need a separate mechanism for this.
> That can be deferred (left undefined by the man pages), but should 
> ideally we should have an idea for how to support it.
> 
> >    The best existing API definition for an fi_commit call would be 
> > the
> fi_readmsg/fi_writemsg() calls.  We could even re-use those calls by 
> adding a flag.
> 
> The proposed definition is limited to RMA and AMO because we didn't 
> have a strong use case for messaging, but I'd like to go the route 
> that allows messaging to be easily included if that changes later down the road.
> 
> 
> >    > *Since this API behaves like a data transfer API, it is 
> > expected that
> this
> >    > API would generate a completion event to the local completion 
> > queue
> associated with the
> >    > EP from which the transaction was initiated against.
> 
> >    The generation of a *CQ* event makes sense.  We need to define if 
> > and how
> counters, locally and remote, are updated.  EQ events are not the 
> right API match.
> 
> Agreed on the CQ aspect. As a note, EQs are not being discussed for 
> the initiator, only the target, so I'll put my EQ comments in the next 
> comment. As a general comment, I think that this could be a good 
> candidate for discussion at the next OFIWG because it is a strange grey area to me.
> 
> >    > *At the target, this should generate an event to the target's
> event queue –
> >    > if and only if the provider supports software emulated events. 
> > If a provider
> is capable
> >    > of hardware level commits to persistent memory, the transaction 
> > should be
> consumed
> >    > transparently by the hardware, and does not need to generate an 
> > event at
> the target.
> >    > This will require an additional event definition in libfabric (See definition for
> >    > fi_eq_commit_entry)
> 
> >    This too needs to be defined based on the application level 
> > semantics, not
> implementation.  The app should not be aware of implementation 
> differences, except where mode bits dictate for performance reasons.  
> (And I can say that developers hate dealing with those differences, so 
> we need to eliminate them.)
> 
> >    If we limit commit to RMA transfers, it makes sense for it to act 
> > as an RMA
> call for most purposes (i.e. fi_readmsg/fi_writemsg).  For example, 
> the ability to carry CQ data and generate remote events 
> (FI_RMA_EVENTS) on the target CQ and counters.  We also need to 
> consider if there's any impact on counters associated with the MR.
> 
> I agree that this needs to be defined in terms of application-level behavior.
> However, I do think we need to talk about if and how applications 
> should be expected to facilitate the desired functionality if the 
> hardware is not capable of it.  The 'how' aspect of a provider like 
> sockets implements the functionality isn't important to define here, 
> but if the provider needs the application to interact/configure in a 
> specific way then I think that should be covered here. If there isn’t 
> hardware support for FI_COMMIT_COMPLETE, then it seems to become a 
> much more difficult problem. Libfabric could provide events to the 
> application through EQ or CQ events, or go a similar route as HMEM is 
> going now. I'd prefer to provide events to the application rather than 
> attempt to support every PMEM library/hardware when handling the software emulation case.
> 
> >    > *A new EQ event definition (fi_eq_commit_entry) to support 
> > software-
> emulated
> >    > persistence for devices that cannot provide hardware support
> >    >
> >    > *The iov, and count variables mirror the original iov, and 
> > count
> contents of
> >    > the originating request.
> >    > *The flags may be a diminished set of flags from the original
> transaction
> >    > under the assumption that only some flags would have meaning at 
> > the
> target and sending
> >    > originator-only flags to the target would have little value to 
> > the target
> process.
> 
> >    If any events are generated, they need to be CQ related, not EQ.
> 
> This is where I believe it becomes a grey area. I could see using 
> FI_RMA_EVENT or something similar to provoke a CQ event generated at 
> the target, but it doesn't feel like fi_commit is a data transfer 
> operation. It seems like a control operation, which is another reason 
> why it was defined as generating an EQ event. The commit/"flush" is a control operation so it feels aligned with EQ.
> 
> 
> >    > *Additional flags or capabilities
> >    >
> >    > *A provider should be able to indicate whether they support
> software
> >    > emulated notifications of fi_commit, or whether they can handle 
> > hardware
> requests for
> >    > commits to persistent memory
> 
> >    The implementation of hardware vs software should not be exposed.  
> > Hybrid
> solutions (e.g. RxM or large transfers over verbs devices) are also possible.
> 
> If libfabric provides an event to the upper layer, I believe libfabric 
> can support many more persistent memory models and devices by 
> propagating events to the upper layer than if we attempt to put that 
> capability into libfabric and support it transparently for the user. 
> It's just my view, but application writers have asked us to optimize data transfers over the network with the abstraction we provide.
> I think. This could be another complicated topic and we could discuss 
> it at the next OFIWG.
> 
> 
> >     The FI_RMA_PMEM capability should be sufficient to indicate 
> > support for
> RMA reads and writes to persistent memory.  That should be an 
> inclusive flag (along with the API version) indicating that all related operations are supported.
> 
> Something like this?
> 
> #define FI_PMEM  (FI_RMA_PMEM | FI_AMO_PMEM | FI_MSG_PMEM)
> 
> 
> >      Support for messaging requires additional definitions.  Part of 
> > the discussion
> is figuring out the scope of what should be defined in the short term.  
> As mentioned above, FI_FENCE | FI_COMMIT_COMPLETE can be used to 
> commit message transfers.  I can't think of a better alternative here.  
> However, I'm not sure if the proposed IBTA and IETF specifications 
> will result in hardware capable of supporting the FI_FENCE | 
> FI_COMMIT_COMPLETE semantic.  :/
> 
> 
> Agreed on messaging, but it lacks a good use case yet so I haven't 
> been as concerned.
> 
> I'm not yet convinced on FI_COMMIT_COMPLETE|FI_FENCE. If libfabric 
> suggested the use of that, does that imply that providers must support 
> 0-length sends and/or control messaging on behalf of the application ? 
> Does the data transfer itself provide any context to the region being 
> flushed? What happens in the case of multiple persistent memory 
> domains or devices? How would that data transfer provide the context 
> necessary to flush a specific region, memory domain, or device? This 
> seems more complicated than the initial suggestion indicates.
> 
> >    > *Addition of an event handler registration for handling event 
> > queue
> entries within
> >    > the provider context (See Definition: fi_eq_event_handler)
> >    >
> >    > *Essentially, this becomes a registered callback for the target
> application
> >    > to handle specific event types. We can use this mechanism with 
> > the target
> application
> >    > to allow the provider to handle events internally using a 
> > function provided
> by the
> >    > application. The function would contain the logic necessary to 
> > handle the
> event
> 
> >    Callbacks are to be avoided.  They present difficult locking 
> > scenarios with
> severe restrictions on what the application can do from the callback, 
> and present challenging object destruction situations.  Those 
> restrictions can be difficult for an application to enforce, since 
> calls outside the app to other libraries may violate them.
> 
> It's a good argument, and generally I feel the same way. What do you 
> suggest as an alternative? Callbacks were suggest as a way for the 
> provider to do some behavior on behalf of the application upon the receipt of the associated event.
> This would have allowed the provider to issue the commit/flush to 
> device and then return the ACK back to the initiator that the commit 
> had succeeded/data was flushed as requested. Without a callback, I do 
> not see a clean way for libfabric to coordinate flush and acknowledgement back to the initiator.
> 
> >    To be clear, the proposal only supports RMA writes, and maybe 
> > atomics, to
> the target memory.  That is likely sufficient for now, but I'd like to 
> ensure that we have a way to extend pmem support beyond the limited 
> use cases being discussed.
> 
> RMA, and atomics -- with the intent not to exclude messaging. This is 
> why the naming change from FI_RMA_PMEM to FI_PMEM was suggested.
> 
> 
> >    > *Previous functionality allows for a commit for every message 
> > as
> is the case
> >    > for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a per-
> transaction basis. The need in
> >    >  ...
> >    > delivery model, and provides a mechanism to ensure that those 
> > data
> transfers are
> >    > eventually persisted.
> 
> >    Unless the app has set FI_COMMIT_COMPLETE as the default 
> > completion
> model, it only applies to the operation on which it was set.  The main 
> gap I'm aware of with proposed specifications is support of a 'flush' type semantic.
> 
> The flush mechanic is the primary gap that the proposal is attempting to identify.
> However, I believe the software emulation elements of the proposal are 
> valuable for prototyping efforts.
> 
> --
> James Swaro
> P: +1 (651) 605-9000
> 
> On 4/27/20, 9:38 PM, "Hefty, Sean" <sean.hefty at intel.com> wrote:
> 
>     Top-posting main discussion point.  Other comments further down:
> 
>     Conceptually, what's being proposed is specifying a data transfer 
> as a 2-step process.
> 
>     1. identify the data source and target
>     2. specify the completion semantic
> 
>     Theoretically, the actual data transfer can occur any time after 
> step 1 and before step 2 completes.  As an additional optimization, 
> step 2 can apply to multiple step 1s.
> 
>     We need to decide:
> 
>     A. What completion semantic applies to step 1?
>     B. What operations do we support for step 1?
>     C. What completion semantics are supported for step 2?
> 
>     The current answers are:
> 
>     A. All completion levels are supported.  It's possible that none 
> of them are desirable here, and we need to introduce a new mode:
> FI_UNDEFINED_COMPLETE.  This would indicate that the buffer cannot be 
> re- used, and the data is not visible at the target, until step 2 
> completes that covers the same target memory range.
> 
>     B. RMA reads and writes are supported.  It shouldn't be difficult 
> to support atomics through the same APIs as well.  Message transfers 
> are more difficult to specify in step 2, making them harder to support.
> 
>     C. The proposal only supports FI_COMMIT_COMPLETE.  Other levels 
> could be added, though that may only make sense if we define something 
> like FI_UNDEFINED_COMPLETE.
> 
>     I'm throwing FI_UNDEFINED_COMPLETE out for discussion.  There 
> would be issues trying to define it, since data transfers issued at 
> step 1 could generate completions locally and remotely prior to step 2 
> being invoked.  Those completions just wouldn't mean anything until 
> step 2 completes.  The provider would select the best completion option for step 1.
> 
> 
>     > Libfabric requires modifications to support RMA and atomic 
> operations targeted at
>     > remote memory registrations backed by persistent memory devices. 
> These modifications
>     > should be made with the intent to drive support for persistent 
> memory usage by
>     > applications that rely on communications middleware such as 
> SHMEM in a manner that is
>     > consistent with byte-based/stream-based addressable memory formats.
> Existing proposals
>     > (initial proposal) support NVMe/PMoF approaches, which this 
> approach should support
>     > flat memory, non-block addressed memory structures and devices.
>     >
>     > Changes may be required in as many as three areas:
>     >
>     > *Memory registration calls
>     >
>     > *This allows a memory region to be registered as being capable 
> of
>     > persistence. This has already been introduced into the upstream 
> libfabric GITHUB, but
>     > should be reviewed to ensure it matches use case requirements.
> 
>     FI_RMA_PMEM is defined as a MR flag.  Note that this definition 
> intentionally limits non-RMA transfers from taking advantage of 
> persistent memory semantics.
> 
>     The intent of this flag is to give providers implementation 
> flexibility, specifically based on hardware/software differences.
> 
> 
>     > *Completion semantics
>     >
>     > *These changes allow a completion event or notification to be 
> deferred until
>     > the referenced data has reached the persistence domain at the 
> target. This has already
>     > been introduced into the upstream libfabric GITHUB, but should 
> be reviewed to ensure it
>     > matches use case requirements.
> 
>     Completion semantics may be adjusted on a per transfer basis.  The 
> FI_COMMMIT_COMPLETE semantic applies to both the initiator and target.
> Completion semantics are a minimal guarantee from a provider.  The 
> provider can do more.
> 
>     > *Consumer control of persistence
>     >
>     > *As presently implemented in the upstream libfabric GITHUB, 
> persistence is
>     > determined on a transaction-by-transaction basis. It was 
> acknowledged at the time that
>     > this is a simplistic implementation. We need to reach consensus 
> on the
> following:
>     >
>     > *Should persistence be signaled on the basis of the target 
> memory
>     > region? For example, one can imagine a scheme where data 
> targeted at a particular
>     > memory region is automatically pushed into the persistence 
> domain by the target,
>     > obviating the need for any sort of commit operation.
> 
>     In cases where a commit operation is not needed, it can become a 
> no-op, but it may be required functionality for some providers.
> 
> 
>     > *Is an explicit 'commit' operation of some type required, and if 
> so,
>     > what is the scope of that commit operation? Is there a 
> persistence fence defined such
>     > that every operation prior to the fence is made persistent by a 
> commit operation?
> 
>     With the current API, persistence can be achieved by issuing a 
> 0-length RMA with FI_COMMIT_COMPLETE | FI_FENCE semantics.  The fence 
> requires that
> *all* prior transfers over that endpoint meet the requested completion 
> semantic.
> 
>     This may not be ideal, but may be the best way to handle message 
> transfers to persistent memory.
> 
> 
>     > Proposal
>     >
>     > The experimental work in the OFIWG/libfabric branch is 
> sufficient for the needs of
>     > SHMEM, with exception to the granularity of event generation. 
> When the current
>     > implementation generates events, it would generate commit-level 
> completion events with
>     > every operation. That type of operation would make the delivery 
> of completion events
>     > take longer than necessary for most operations, so SHMEM would 
> need finer control over
>     > commit flushing behavior.
> 
>     OFI does not require that an event be generated for every 
> transfer.  It also allows transfers to report completions using 
> 'lower' completion semantics, such as FI_TRANSMIT_COMPLETE.  
> Completion events at the target of an RMA write requires the FI_RMA_EVENT capability, and is independent from PMEM.
> 
>     > To satisfy this, the following is being proposed:
>     >
>     > *A new API: fi_commit (See definitions: fi_commit)
>     > The new API would be used to generate a commit instruction to a 
> target peer. The
>     > instruction would be defined by a set of memory registration 
> keys, or regions by which
>     > the target could issue a commit to persistent memory.
> 
>     See discussion at the top.
> 
> 
>     > *A single request to fi_commit should generate a control message 
> to target
>     > hardware or software emulation environment to flush the contents 
> of memory targets.
> 
>     This needs to be defined in terms of application level semantics, 
> not implementation details.  fi_commit could be a no-op based on the 
> provider implementation.  (It actually would be for the socket and tcp 
> providers, which act at the target based on the MR flag.)
> 
>     > Memory targets are defined by the iov structures, and key fields 
> – and the number of
>     > memory targets are defined by the count field. The destination 
> address is handled by
>     > the dest_addr field. The flags field is held reserved at this time to allow for
>     > flexibility in the API design to future proof against options we 
> might not conceive of
>     > until after the prototype is complete, and the context available 
> for the user and
>     > returned with the completion
> 
>     The proposed definition is limited to RMA (and atomic) writes.  
> There is no mechanism for handling RMA reads into persistent memory, 
> for example.  That should be included.  Message transfers may need a separate mechanism for this.
> That can be deferred (left undefined by the man pages), but should 
> ideally we should have an idea for how to support it.
> 
>     The best existing API definition for an fi_commit call would be 
> the
> fi_readmsg/fi_writemsg() calls.  We could even re-use those calls by 
> adding a flag.
> 
>     > *Since this API behaves like a data transfer API, it is expected 
> that this
>     > API would generate a completion event to the local completion 
> queue associated with the
>     > EP from which the transaction was initiated against.
> 
>     The generation of a *CQ* event makes sense.  We need to define if 
> and how counters, locally and remote, are updated.  EQ events are not 
> the right API match.
> 
> 
>     > *At the target, this should generate an event to the target's 
> event queue –
>     > if and only if the provider supports software emulated events. 
> If a provider is capable
>     > of hardware level commits to persistent memory, the transaction 
> should be consumed
>     > transparently by the hardware, and does not need to generate an 
> event at the target.
>     > This will require an additional event definition in libfabric (See definition for
>     > fi_eq_commit_entry)
> 
>     This too needs to be defined based on the application level 
> semantics, not implementation.  The app should not be aware of 
> implementation differences, except where mode bits dictate for 
> performance reasons.  (And I can say that developers hate dealing with 
> those differences, so we need to eliminate them.)
> 
>     If we limit commit to RMA transfers, it makes sense for it to act 
> as an RMA call for most purposes (i.e. fi_readmsg/fi_writemsg).  For 
> example, the ability to carry CQ data and generate remote events 
> (FI_RMA_EVENTS) on the target CQ and counters.  We also need to 
> consider if there's any impact on counters associated with the MR.
> 
> 
>     > *A new EQ event definition (fi_eq_commit_entry) to support 
> software- emulated
>     > persistence for devices that cannot provide hardware support
>     >
>     > *The iov, and count variables mirror the original iov, and count 
> contents of
>     > the originating request.
>     > *The flags may be a diminished set of flags from the original 
> transaction
>     > under the assumption that only some flags would have meaning at 
> the target and sending
>     > originator-only flags to the target would have little value to 
> the target process.
> 
>     If any events are generated, they need to be CQ related, not EQ.
> 
> 
>     > *Additional flags or capabilities
>     >
>     > *A provider should be able to indicate whether they support 
> software
>     > emulated notifications of fi_commit, or whether they can handle 
> hardware requests for
>     > commits to persistent memory
> 
>     The implementation of hardware vs software should not be exposed.  
> Hybrid solutions (e.g. RxM or large transfers over verbs devices) are also possible.
> 
> 
>     > *An additional flag should be introduced to the fi_info 
> structure
>     > under modes: FI_COMMIT_MANUAL (or something else)
> 
>     The FI_RMA_PMEM capability should be sufficient to indicate 
> support for RMA reads and writes to persistent memory.  That should be 
> an inclusive flag (along with the API version) indicating that all related operations are supported.
> 
> 
>     > *This flag would indicate to the application that events may be
>     > generated to the event queue for consumption by the application. 
> Commit events would be
>     > generated upon receipt of a commit message from a remote peer, 
> and the application
>     > would be responsible for handling the event.
>     > *Lack of the FI_COMMIT_MANUAL flag, and the presence of the
>     > FI_RMA_PMEM (or FI_PMEM) flag in the info structure should imply 
> that the hardware is
>     > capable of handling the commit requests to persistent memory and 
> the application does
>     > not need to read the event queue for commit events.
>     >
>     > *Change of flag definition
>     >
>     > *The FI_RMA_PMEM flag should be changed to FI_PMEM to indicate 
> that the
>     > provider is PMEM aware, and supports RMA/AMO/MSG operations to 
> and from persistent
>     > memory.
>     > *There may be little value in supporting messaging interfaces, 
> but it is
>     > something that could supported.
> 
>     Support for messaging requires additional definitions.  Part of 
> the discussion is figuring out the scope of what should be defined in 
> the short term.  As mentioned above, FI_FENCE | FI_COMMIT_COMPLETE can 
> be used to commit message transfers.  I can't think of a better 
> alternative here.  However, I'm not sure if the proposed IBTA and IETF 
> specifications will result in hardware capable of supporting the 
> FI_FENCE | FI_COMMIT_COMPLETE semantic.  :/
> 
> 
>     > *Addition of an event handler registration for handling event 
> queue entries within
>     > the provider context (See Definition: fi_eq_event_handler)
>     >
>     > *Essentially, this becomes a registered callback for the target 
> application
>     > to handle specific event types. We can use this mechanism with 
> the target application
>     > to allow the provider to handle events internally using a 
> function provided by the
>     > application. The function would contain the logic necessary to 
> handle the event
> 
>     Callbacks are to be avoided.  They present difficult locking 
> scenarios with severe restrictions on what the application can do from 
> the callback, and present challenging object destruction situations.  
> Those restrictions can be difficult for an application to enforce, 
> since calls outside the app to other libraries may violate them.
> 
> 
>     > *Specific to PMEM, a function handler would be used by the 
> target
>     > application to handle commits to persistent memory as they were 
> delivered without
>     > requiring a fi_eq_read and some form of acknowledgement around 
> the commit action. With
>     > the handler, the commit could be handled entirely by the 
> function provided by the
>     > application, and the return code from the application provided 
> call-back would be
>     > sufficient for a software emulation in the provider to produce 
> the return message to
>     > the sender that the commit transaction is fully complete. The 
> use of a handler allows
>     > us to make the commit transaction as light-weight, or 
> heavy-weight as necessary.
>     >
>     > Definitions:
>     >
>     > fi_commit
>     >
>     > ssize_t fi_commit(struct fid_ep *ep,
>     >
>     >                              const struct fi_rma_iov *iov,
>     >
>     >                              size_t count,
>     >
>     >                              fi_addr_t dest_addr,
>     >
>     >                              uint64_t flags,
>     >
>     >                              void *context);
>     >
>     > fi_eq_commit_entry
>     >
>     > struct fi_eq_commit_entry {
>     >
>     >     fid_t                       fid;            /* fid associated with request */
>     >
>     >     const struct fi_rma_iov    *iov;            /* iovec of memory regions to be
>     > committed to persistent memory */
>     >
>     >     size_t                      count;          /* number of iovec/key entries */
>     >
>     >     uint64_t                    flags;          /* operation-specific flags */
>     >
>     > };
>     >
>     > fi_eq_event_handler
>     >
>     > typedef ssize_t (*fi_eq_event_handler_t)(struct fid_eq *eq,
>     >
>     >     uint64_t event_type,
>     >
>     >     void *event_data,
>     >
>     >     uint64_t len,
>     >
>     >     void *context);
>     >
>     > ssize_t fi_eq_register_handler(struct fid_eq *eq,
>     >
>     >     uint64_t event_type,
>     >
>     >     fi_eq_event_handler_t handler,
>     >
>     >     void *context);
>     >
>     > Use cases supported by this proposal:
>     >
>     > *As an application writer, I need to commit multiple previously-sent data
>     > transfers to the persistence domain
> 
>     To be clear, the proposal only supports RMA writes, and maybe 
> atomics, to the target memory.  That is likely sufficient for now, but 
> I'd like to ensure that we have a way to extend pmem support beyond 
> the limited use cases being discussed.
> 
> 
>     > *Previous functionality allows for a commit for every message as 
> is the case
>     > for FI_COMMIT_COMPLETE, or the use of FI_COMMIT on a 
> per-transaction basis. The need in
>     > this use case is performance-oriented, to allow less strict 
> delivery model to the NIC
>     > for most messages followed up with a 'flush' of the NIC to the 
> persistence domain. This
>     > allows most messages targeted to the persistence domain to 
> complete with a less strict
>     > delivery model, and provides a mechanism to ensure that those 
> data transfers are
>     > eventually persisted.
> 
>     Unless the app has set FI_COMMIT_COMPLETE as the default 
> completion model, it only applies to the operation on which it was 
> set.  The main gap I'm aware of with proposed specifications is support of a 'flush' type semantic.
> 
> 
>     - Sean
> 
> 
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lists.openfabrics.org_mailman_listinfo_ofiwg&d=DwIGaQ&c=C5b8zRQO1
> miGmBeVZ2LFWg&r=Gu85MpS7ImGmwh9TaJU-
> rXwAoPzObckoDNIQpAj4MDo&m=Qj08qjdDK2mAzsmlUcJSi6FH3QDFiIz5O7BNvH
> aCvTs&s=PbOj9sBPeYA9Giq_DII7GYYyoLXmwpPiOWLwIylEGrQ&e=
> 
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lists.openfabrics.org_mailman_listinfo_ofiwg&d=DwIGaQ&c=C5b8zRQO1
> miGmBeVZ2LFWg&r=Gu85MpS7ImGmwh9TaJU-
> rXwAoPzObckoDNIQpAj4MDo&m=Qj08qjdDK2mAzsmlUcJSi6FH3QDFiIz5O7BNvH
> aCvTs&s=PbOj9sBPeYA9Giq_DII7GYYyoLXmwpPiOWLwIylEGrQ&e=
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lists.openfabrics.org_mailman_listinfo_ofiwg&d=DwIGaQ&c=C5b8zRQO1
> miGmBeVZ2LFWg&r=Gu85MpS7ImGmwh9TaJU-
> rXwAoPzObckoDNIQpAj4MDo&m=Qj08qjdDK2mAzsmlUcJSi6FH3QDFiIz5O7BNvH
> aCvTs&s=PbOj9sBPeYA9Giq_DII7GYYyoLXmwpPiOWLwIylEGrQ&e=