[ofiwg] Proposal for enhancement to support additional Persistent Memory use cases (ofiwg/libfabric#5874)

Gromadzki, Tomasz tomasz.gromadzki at intel.com
Wed May 6 02:42:48 PDT 2020

I add a couple of comments/questions below but have few general.

There was a question about examples of application use cases.
They are essential and already included in the new verbs specification.
For instance, the message flow that Chet uses in presentation (write-flush-(verify)-atomic_write-flush) is taken directly from the database domain. It is prepared for OLTP and other logs update in a transactional way.

But there is one more perspective that we should consider when new API is designed. 
Persistent memory is a new type of technology, and we do not know all use-cases. But we have already collected some knowledge about how to use it properly. This knowledge is also a part of new verbs definition:
For instance, we have observed that writing to remote persistent memory followed by read operation to ensure flush data from RNIC to PMem is not a good solution from the performance point of view and should be avoided by an application. So we have a separate verb instead of an additional flag added to every write request. 

> -----Original Message-----
> From: ofiwg <ofiwg-bounces at lists.openfabrics.org> On Behalf Of Hefty,
> Sean
> Sent: Tuesday, May 5, 2020 11:14 PM
> To: Douglas, Chet R <chet.r.douglas at intel.com>; Grun, Paul
> <paul.grun at hpe.com>; Rupert Dance - SFI <rsdance at soft-forge.com>;
> Swaro, James E <james.swaro at hpe.com>; ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] Proposal for enhancement to support additional
> Persistent Memory use cases (ofiwg/libfabric#5874)
> > -Also, I didn't see any mention of memory registration attributes?  I
> > know its not something apps need from the library, but its something the
> RNIC needs from the app...
> This is there today, so I overlooked including it.  But this isn't really a feature
> that's being exposed, but a restriction that providers have to make this work
> well.
> > There are 4 main lower-level functions that need to be mapped to:
> >
> > 1. **8-byte atomic write ordered with RDMA writes** OFI defines a more
> > generic atomic write.  Message ordering is controlled through
> > fi_tx_attr::msg_order flags.  Data ordering is controlled through
> > fi_ep_attr::max_order_waw_size.  The existing API should be sufficient.
> >
> > Chet> How will the provider know which opcode to put on the wire if we
> > Chet> use the same
> > API?
> For verbs, this isn't an issue because there's not an alternative write atomic
> operation.
> For providers with multiple protocols available, the full set of attributes used
> to configure the endpoint needs to guide the selection.  For example, if the
> application requires write-after-write message order, that's indicated
> through a msg_order flag.  If they need all write data placed in order,
> max_order_waw_size conveys that.

The problem I see here is some additional logic implemented in the new RDMA atomic write. It is ordering rules and memory alignment.
Atomic write has different ordering rules than existing atomics and normal write - it waits for a previously requested flush. Still, it is not a stopper for normal write or other atomics. Additionally, atomic write to unaligned address will be rejected by the verbs provider.

I know that we could define a combination of flags that reflects this type of behavior, but it will be hard for a user to use this and could be a source of hard to detect errors.

And one more question:
Present libfabric atomic write is mapped to ibverbs write with fence flag (Please correct if I misunderstand present implementation).
How to explain to end user that this behavior will change? 
This is something that will not be noticed as there will not be any API changes.

> We have places in libfabric today where the protocol changes based on
> various attributes or operational flags.
> > 2. **flush data for persistency**
> > The low-level flush operation ensures previous RDMA and atomic write
> > operations to a given target region are persistent prior to
> > completing.  The target region may be accessible through multiple
> > endpoints and NIC ports.  Also, low-level transports require write
> > after write message and data ordering, which is assumed by the flush
> operation.
> > OFI defines FI_COMMIT_COMPLETE for persistent completion semantics.
> > This provides limited support, handling only the following mapping:
> > RMA write followed by a matching flush.  A more generic mechanism
> > needs to be defined, which would allow for a less strict completion on
> > the RMA writes, with the persistent command following.  This is
> > possible today through the FI_FENCE flag, but that could result in stalls in
> the messaging.
> >
> > Chet> Does the current implementation assume there is a single write
> > Chet> with a single
> > flush that has the exact same rkey and regions?  Obviously need to
> > assume many writes before a flush and the flush may be for a portion of
> the written region.
> The current implementation would only work for a single write followed by a
> single flush to the exact same region.  This is being called out to highlight the
> gap, so I wouldn't focus on it other than for that purpose.  This github
> comment wasn't trying to propose a solution.

One write, followed by one flush is the most ineffective way of remote memory access - this is one of the most important things we have learned trying to implement different algorithms of remote persistent memory access. 

We have also noticed that in the majority cases, the software does not know if it executes the last write in a sequence. 
Retake an example of a DB engine. A transaction log is written record by record to remote pmem, but the decision about transaction commit could be not related in any way to the last write.

And one more workflow - taken from existing database solution. 
The solution is based on the sequence of send-read (flush)  requests.
Send to deliver data to PMem (in order), read is used to ensure that previous send has finished in the persistent domain.

How many providers that will support persistent memory natively do we have?
Shall we look at these solutions and try to build libfabric API for RPMEM in a way that will be much closer to what we could deliver with existing technologies?

> > Chet> What about the GO/P PLT placement attributes of the flush
> > Chet> command?  We will need
> > to expose those as well.
> I listed flush operation for visibility purposes as a separate feature, just
> below.
> > 3. **flush data for global visibility** This is similar to 2, with
> > application and fabric visibility replacing persistency.
> > OFI defines FI_DELIVERY_COMPLETE as a visibility completion semantic.
> > This has similar limits as mentioned above.
> >
> > 4. **Data verify**
> > There is no equivalent existing functionality, but it is aligned with
> > discussions around SmartNIC and FPGA support, which defines generic
> offload functionality.
> >
> > Chet>  Sounds like a good fit

Verify is critical for workloads that would like to ensure consistency of persistent remote data.
We do not need any SmartNIC to support them at the first stage. 
Verify could be implemented as a read to an internal driver's buffer where CRC is calculated and provided to the end-user.


Intel Technology Poland sp. z o.o.

ul. Słowackiego 173 | 80-298 Gdańsk | Sąd Rejonowy Gdańsk Północ | VII Wydział Gospodarczy Krajowego Rejestru Sądowego - KRS 101882 | NIP 957-07-52-316 | Kapitał zakładowy 200.000 PLN.

Ta wiadomość wraz z załącznikami jest przeznaczona dla określonego adresata i może zawierać informacje poufne. W razie przypadkowego otrzymania tej wiadomości, prosimy o powiadomienie nadawcy oraz trwałe jej usunięcie; jakiekolwiek przeglądanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

More information about the ofiwg mailing list