[ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)
Douglas, Chet R
chet.r.douglas at intel.com
Tue May 5 13:33:01 PDT 2020
-See comments below.
-Also, I didn’t see any mention of memory registration attributes? I know its not something apps need from the library, but its something the RNIC needs from the app...
From: Hefty, Sean <sean.hefty at intel.com>
Sent: Tuesday, May 05, 2020 1:50 PM
To: Grun, Paul <paul.grun at hpe.com>; Douglas, Chet R <chet.r.douglas at intel.com>; Rupert Dance - SFI <rsdance at soft-forge.com>; Swaro, James E <james.swaro at hpe.com>; ofiwg at lists.openfabrics.org
Subject: RE: [ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)
> I believe the current proposals do take into account the IBTA/IETF
> drafts. But in any case, none of this is going anywhere yet until
> it's been thoroughly discussed and reviewed, beginning at tomorrow's
> OFIWG meeting. As of now, you are first on tomorrow's agenda.
Based on the ofiwg call today, I added to github a list of the low-level features that applications should be able to access, and how well the current API maps to those features. This is copied below.
If we can ensure that we have the right feature list, and agree on the current status of how those features are or are not being met, it should make it easier to design the right solution to cover the gaps.
There are 4 main lower-level functions that need to be mapped to:
1. **8-byte atomic write ordered with RDMA writes** OFI defines a more generic atomic write. Message ordering is controlled through fi_tx_attr::msg_order flags. Data ordering is controlled through fi_ep_attr::max_order_waw_size. The existing API should be sufficient.
Chet> How will the provider know which opcode to put on the wire if we use the same API?
2. **flush data for persistency**
The low-level flush operation ensures previous RDMA and atomic write operations to a given target region are persistent prior to completing. The target region may be accessible through multiple endpoints and NIC ports. Also, low-level transports require write after write message and data ordering, which is assumed by the flush operation.
OFI defines FI_COMMIT_COMPLETE for persistent completion semantics. This provides limited support, handling only the following mapping: RMA write followed by a matching flush. A more generic mechanism needs to be defined, which would allow for a less strict completion on the RMA writes, with the persistent command following. This is possible today through the FI_FENCE flag, but that could result in stalls in the messaging.
Chet> Does the current implementation assume there is a single write with a single flush that has the exact same rkey and regions? Obviously need to assume many writes before a flush and the flush may be for a portion of the written region.
Chet> What about the GO/P PLT placement attributes of the flush command? We will need to expose those as well.
3. **flush data for global visibility**
This is similar to 2, with application and fabric visibility replacing persistency.
OFI defines FI_DELIVERY_COMPLETE as a visibility completion semantic. This has similar limits as mentioned above.
4. **Data verify**
There is no equivalent existing functionality, but it is aligned with discussions around SmartNIC and FPGA support, which defines generic offload functionality.
Chet> Sounds like a good fit
More information about the ofiwg