[libfabric-users] libfabric transaction ordering w.r.t. Chapel memory consistency model
gregory.titus at hpe.com
Mon Feb 10 14:45:48 PST 2020
I'm implementing a libfabric-based multi-node communication module for the Chapel (https://chapel-lang.org/) runtime library. I have some questions about how best to use libfabric's capabilities to implement Chapel's memory consistency model (MCM), while nevertheless maximizing performance.
Chapel's MCM is based on sequential consistency for data-race-free programs as adopted by a number of other languages. In its simplest form, the MCM says that atomic ops done by a task are seen to occur in program order with respect to both other atomic ops and sequences of regular loads and stores done by that task, while regular loads and stores are not ordered with respect to each other except that a task sees its own loads and stores to the same address to have occurred in program order. It's basically the UPC "non-strict" MCM, if you're familiar with that.
For the following, assume I'm using FI_TRANSMIT_COMPLETE for completion, either by default or explicitly. This might be either because FI_DELIVERY_COMPLETE isn't available in some providers, or because transmit_complete performs better than delivery_complete.
For regular loads and stores, including the same-address clause, it seems I could make libfabric match the Chapel MCM just by asserting FI_ORDER_RAW|FI_ORDER_WAR|FI_ORDER_WAW. But that's overkill, because it will order transactions to all addresses, not just those that target the same address. So I could also do something like maintaining a list of outstanding remote writes in each task and consulting that for address matches to see if a later read refers to an earlier write that is still in flight. That's no problem - it's a common technique for improving performance. But if I do that, how can I know when the result of a write is visible in the remote memory, so that I can retire the matching outstanding-write list entry? I believe I can force writes to complete remotely by setting up endpoints with FI_ORDER_RAW and then doing dummy reads from each target I'm interested in. But that seems heavyweight because it forces the writes to complete and really, I only want to be informed when they complete, not force them to do so. It looks like FI_FENCE could be used to solve this, but I'm not sure that's available to me because I think I need to work with the verbs;ofi_rxm provider and ofi_rxm doesn't support FI_FENCE.
I have a similar issue with respect to ordering atomics. I need to ensure that the effects on target objects of a sequence of atomic ops done by a single task are seen to occur in program order. Would asserting FI_ORDER_ATOMIC_WAW on both initiating and target endpoints guarantee order for target object updates? I also need to ensure that when a single task does an atomic op followed by a regular load or store, the effect of the atomic op on its target object is seen before the load or store references memory. The fi_atomic(3) man page only says that a completion isn't delivered at the originator until after the result of a fetching atomic op is available there, and a completion (if any) isn't delivered at the target until after the effect of an atomic op on its target object is visible there. There doesn't seem to be a direct way to connect the change to the target object and the delivery of a completion event to the initiator. So what's the best way for an originator to ensure that the target effect of an atomic op is visible before it continues? Do I need to do something like requesting remote completions for atomic ops and have targets send messages back to initiators when they see such events? That seems heavyweight.
I also have to ensure that the effects of a sequence of regular loads and stores are visible before the effect on the target of a subsequent atomic op, but I believe for that I can use an extension of the solution for the same-address clause, and do a read from every target I've written to since the last atomic op, with FI_ORDER_RAW asserted.
(Note that some of the above may be moot for ofi_rxm, since at least in v1.8 it is documented to disable support for some of the orderings if you ask for the FI_ATOMIC capability. So for that one I may simply forego libfabric network atomics and use processor atomics via Active Messages, a capability that already exists in Chapel because of the need to be portable to less-capable networks.)
Thanks for reading this far if you did, and thanks for any insights you can provide!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users