[libfabric-users] libfabric transaction ordering w.r.t. Chapel memory consistency model

Tue Feb 11 08:22:51 PST 2020

> ... how can I know when the result of a write is visible in
> the remote memory, so that I can retire the matching outstanding-write list entry?

Are you wanting the initiator or the target to know that the data is visible?  For the target, verbs indicates that the data is visible when a completion is read at the target for an operation that followed the write, or if the completion is for the write itself (i.e. carries CQ data).
The initiator.  The requirement here is that a task (Chapel's instance of sequential execution) must observe the results of its own regular reads and writes to the same address to have occurred in execution order.  I.e., when a task writes to some location and then reads from the same location, the value it reads must be the one that was written.  Similarly, when a task reads from and then writes to a location, the value read must be what the location held before the write.  Or more succinctly, within a single task, regular reads and writes to the same location cannot be reordered.  (This is for data-race-free programs, so assume no other task is referencing this same location during this period.)

> ... I also need to ensure that
> when a single task does an atomic op followed by a regular load or store, the effect of
> the atomic op on its target object is seen before the load or store references memory.

ORDER_WAW orders both atomic updates and RMA write operations against each other.  ORDER_ATOMIC_WAW and ORDER_RMA_WAW allows specifying those separately.  It sounds like ORDER_WAW (etc.) is what you want.

Beyond saying that it doesn't support FI_FENCE (as discussed below), the fi_rxm man page also says that if FI_ATOMIC is specified in the hints capabilities, FI_ORDER_{RAR,RAW,WAR,WAW,SAR,SAW} support is disabled.  It also doesn't include FI_ATOMIC in the capabilities unless you specifically request it, which may well be because of this limitation.  So I'm pretty sure I'm going to be using processor atomics done via Active Messages for remote atomic ops with ofi_rxm;verbs.

Thanks for all the feedback, Sean!  Not all the answers make me happy from a performance point of view, but at least it doesn't sound like I missed any better ways of doing things than the ones I'd come up with.

greg

________________________________
From: Hefty, Sean <sean.hefty at intel.com>
Sent: Monday, February 10, 2020 4:38 PM
To: Titus, Greg <gregory.titus at hpe.com>; libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Subject: RE: libfabric transaction ordering w.r.t. Chapel memory consistency model

> I'm implementing a libfabric-based multi-node communication module for the Chapel
> (https://chapel-lang.org/ ) runtime library.  I have some questions about how best to
> use libfabric's capabilities to implement Chapel's memory consistency model (MCM),
> while nevertheless maximizing performance.
>
> Chapel's MCM is based on sequential consistency for data-race-free programs as adopted
> by a number of other languages.  In its simplest form, the MCM says that atomic ops
> done by a task are seen to occur in program order with respect to both other atomic ops
> and sequences of regular loads and stores done by that task, while regular loads and
> stores are not ordered with respect to each other except that a task sees its own loads
> and stores to the same address to have occurred in program order.  It's basically the
> UPC "non-strict" MCM, if you're familiar with that.

I'm not familiar with non-strict MCM, and can't say that I fully understand the desired semantics being requested.

>From the description, it sounds like you are wanting ORDER_RAW / WAR / and WAW, as you mention below.

> For the following, assume I'm using FI_TRANSMIT_COMPLETE for completion, either by
> default or explicitly.  This might be either because FI_DELIVERY_COMPLETE isn't
> available in some providers, or because transmit_complete performs better than
> delivery_complete.

Delivery complete requires a software based ack for most (all?) hardware.

> For regular loads and stores, including the same-address clause, it seems I could make
> libfabric match the Chapel MCM just by asserting
> FI_ORDER_RAW|FI_ORDER_WAR|FI_ORDER_WAW.  But that's overkill, because it will order

It might seem like overkill, but probably isn't that bad if you consider what it takes to support these.  You likely lose dynamic routing as a network feature.  The verbs and tcp providers will give you WAW and RAW anyway.

A more significant issue is that WAR ordering isn't supported natively by verbs hardware.  You may need to fence write operations that follow reads to the same memory location.

> transactions to all addresses, not just those that target the same address.  So I could
> also do something like maintaining a list of outstanding remote writes in each task and
> consulting that for address matches to see if a later read refers to an earlier write
> that is still in flight.  That's no problem - it's a common technique for improving
> performance.  But if I do that, how can I know when the result of a write is visible in
> the remote memory, so that I can retire the matching outstanding-write list entry?  I

Are you wanting the initiator or the target to know that the data is visible?  For the target, verbs indicates that the data is visible when a completion is read at the target for an operation that followed the write, or if the completion is for the write itself (i.e. carries CQ data).

I don't recall if libfabric defines this level of visibility, but I don't think it does.  That is likely a gap in the receive side completion semantics.

If you need the initiator to know this, the API option is to issue an operation with delivery_complete with a fence flag (also like you mention).  This is the best option we have to.day, but I'm looking at other possibilities here (in the context of persistent memory).

> believe I can force writes to complete remotely by setting up endpoints with
> FI_ORDER_RAW and then doing dummy reads from each target I'm interested in.  But that
> seems heavyweight because it forces the writes to complete and really, I only want to
> be informed when they complete, not force them to do so.  It looks like FI_FENCE could
> be used to solve this, but I'm not sure that's available to me because I think I need
> to work with the verbs;ofi_rxm provider and ofi_rxm doesn't support FI_FENCE.

Hmm... this sounds like a gap.  I don't know why rxm doesn't just pass the fence flag through.

> I have a similar issue with respect to ordering atomics.  I need to ensure that the
> effects on target objects of a sequence of atomic ops done by a single task are seen to
> occur in program order.  Would asserting FI_ORDER_ATOMIC_WAW on both initiating and
> target endpoints guarantee order for target object updates?  I also need to ensure that
> when a single task does an atomic op followed by a regular load or store, the effect of
> the atomic op on its target object is seen before the load or store references memory.

ORDER_WAW orders both atomic updates and RMA write operations against each other.  ORDER_ATOMIC_WAW and ORDER_RMA_WAW allows specifying those separately.  It sounds like ORDER_WAW (etc.) is what you want.

> The fi_atomic(3) man page only says that a completion isn't delivered at the originator
> until after the result of a fetching atomic op is available there, and a completion (if
> any) isn't delivered at the target until after the effect of an atomic op on its target
> object is visible there.  There doesn't seem to be a direct way to connect the change
> to the target object and the delivery of a completion event to the initiator.  So
> what's the best way for an originator to ensure that the target effect of an atomic op
> is visible before it continues?  Do I need to do something like requesting remote
> completions for atomic ops and have targets send messages back to initiators when they
> see such events?  That seems heavyweight.

If you want the initiator to know that the data is visible at the target, then delivery_complete is the semantic that you want.  However... rxm currently does not properly implement delivery_complete semantics.  But if/when it does, it will require the use of software acks above verbs devices.  Yes, it's heavyweight, but it's the only option I'm aware of for those devices.

> I also have to ensure that the effects of a sequence of regular loads and stores are
> visible before the effect on the target of a subsequent atomic op, but I believe for
> that I can use an extension of the solution for the same-address clause, and do a read
> from every target I've written to since the last atomic op, with FI_ORDER_RAW asserted.
>
> (Note that some of the above may be moot for ofi_rxm, since at least in v1.8 it is
> documented to disable support for some of the orderings if you ask for the FI_ATOMIC
> capability.  So for that one I may simply forego libfabric network atomics and use
> processor atomics via Active Messages, a capability that already exists in Chapel
> because of the need to be portable to less-capable networks.)

I think this is a result of implementing the atomics in software, but still using RMA hardware when available.

- Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200211/97d2ea53/attachment-0001.htm>