[ofiwg] libfabric network atomic operations, processor atomic ops, and coherence
Hefty, Sean
sean.hefty at intel.com
Tue Feb 26 11:31:36 PST 2019
Copying ofiwg
> I could get a fabric-specific version of this question answered within Cray
> Inc. But I thought there might be interest in the general version, so I’m
> asking here after doing a quick check of the last 2 years of the archives and
> not seeing it discussed.
>
> A network that implements network atomic operations may do so such that those
> ops are not coherent with processor atomic ops (including even simple READs)
> to the same memory locations. This is true of Cray Inc.’s Aries interconnect,
> for example. On an Aries-based Cray XC system, for a given memory location
> one has to either do all atomics ops using the NIC, or do all of them using
> the processor, or use both the NIC and the processor but take additional
> (well-defined) actions to ensure coherence between the two. What can a
> programmer using libfabric atomics assume about coherence between network and
> processor atomic operations when running on such a network? Is it incumbent
> upon providers supporting such networks to create the appearance of coherence?
> Or is it incumbent upon the programmer to know about possible non-coherence
> and deal with it, and if so, does libfabric give any help with this? I’ve
> looked through the man pages and don’t see any discussion of atomic coherence,
> either in fi_atomic(3) or fi_gni(7).
There is no guarantee that NIC/network based atomics will be coherent with CPU based atomics, or that they will be coherent between NICs, or the final result will even be atomic. The intent is to allow for a variety of implementations, including the possibility that a NIC may temporarily cache the results of an atomic operation.
The fi_atomic man pages describes when the results of a network-based atomic operation will be visible to the CPU. That discussion can be expanded to explicitly state that atomics performed by different NICs or the CPU to the same memory region may not be atomic wrt each other.
> I’m interested in this because I’m currently developing a libfabric-based
> implementation of the multi-node communication layer of the runtime library
> for Chapel (https://chapel-lang.org/). Chapel has atomics, and our existing
> uGNI-native comm layer has to take into account the processor/network atomic
> non-coherence on Gemini- and Aries-based Cray XE and XC systems. We’d like
> the new libfabric comm layer to work with as wide a variety of providers as
> possible, so we’d like to take the most general approach to the coherence
> problem we can.
>
> Thanks!
>
> Greg Titus
> Chapel core team
- Sean
More information about the ofiwg
mailing list