[ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Hefty, Sean sean.hefty at intel.com
Tue May 5 14:13:45 PDT 2020

> -Also, I didn’t see any mention of memory registration attributes?  I know its not
> something apps need from the library, but its something the RNIC needs from the app...

This is there today, so I overlooked including it.  But this isn't really a feature that's being exposed, but a restriction that providers have to make this work well.

> There are 4 main lower-level functions that need to be mapped to:
> 1. **8-byte atomic write ordered with RDMA writes** OFI defines a more generic atomic
> write.  Message ordering is controlled through fi_tx_attr::msg_order flags.  Data
> ordering is controlled through fi_ep_attr::max_order_waw_size.  The existing API should
> be sufficient.
> Chet> How will the provider know which opcode to put on the wire if we use the same
> API?	

For verbs, this isn't an issue because there's not an alternative write atomic operation.

For providers with multiple protocols available, the full set of attributes used to configure the endpoint needs to guide the selection.  For example, if the application requires write-after-write message order, that's indicated through a msg_order flag.  If they need all write data placed in order, max_order_waw_size conveys that.

We have places in libfabric today where the protocol changes based on various attributes or operational flags.

> 2. **flush data for persistency**
> The low-level flush operation ensures previous RDMA and atomic write operations to a
> given target region are persistent prior to completing.  The target region may be
> accessible through multiple endpoints and NIC ports.  Also, low-level transports
> require write after write message and data ordering, which is assumed by the flush
> operation.
> OFI defines FI_COMMIT_COMPLETE for persistent completion semantics.  This provides
> limited support, handling only the following mapping: RMA write followed by a matching
> flush.  A more generic mechanism needs to be defined, which would allow for a less
> strict completion on the RMA writes, with the persistent command following.  This is
> possible today through the FI_FENCE flag, but that could result in stalls in the
> messaging.
> Chet> Does the current implementation assume there is a single write with a single
> flush that has the exact same rkey and regions?  Obviously need to assume many writes
> before a flush and the flush may be for a portion of the written region.

The current implementation would only work for a single write followed by a single flush to the exact same region.  This is being called out to highlight the gap, so I wouldn't focus on it other than for that purpose.  This github comment wasn't trying to propose a solution.

> Chet> What about the GO/P PLT placement attributes of the flush command?  We will need
> to expose those as well.

I listed flush operation for visibility purposes as a separate feature, just below.

> 3. **flush data for global visibility**
> This is similar to 2, with application and fabric visibility replacing persistency.
> OFI defines FI_DELIVERY_COMPLETE as a visibility completion semantic.  This has similar
> limits as mentioned above.
> 4. **Data verify**
> There is no equivalent existing functionality, but it is aligned with discussions
> around SmartNIC and FPGA support, which defines generic offload functionality.
> Chet>  Sounds like a good fit

- Sean

More information about the ofiwg mailing list