[ofiwg] the problems with atomics

Jeff Hammond jeff.science at gmail.com
Mon Jul 13 13:22:12 PDT 2015


Is that sufficient?  If the result of sizeof() matches but the bitwise
representation does not, will OFI still work?

Jeff

On Mon, Jul 13, 2015 at 1:20 PM, Sur, Sayantan <sayantan.sur at intel.com> wrote:
> If we do punt the issue to an upper layer, such as MPI, then should we
> export the fi_datatype_size() method so that apps can find out what
> libfabric was compiled with?
>
> On 7/8/15, 6:10 AM, "ofiwg-bounces at lists.openfabrics.org on behalf of
> Underwood, Keith D" <ofiwg-bounces at lists.openfabrics.org on behalf of
> keith.d.underwood at intel.com> wrote:
>
>>Technically, MPI implementations are supposed to detect the heterogeneity
>>and deal with it (e.g. for the endianness problem).
>>
>>For LONG_DOUBLE, MPI essentially makes the assumption that the app and
>>MPI are compiled with "compatible" compilers and options regarding
>>LONG_DOUBLE formats.  Then again, the same is kind of true for the
>>definition of things like MPI_INT and MPI_LONG_INT, right?  Have we all
>>forgotten the pain of the transition to 64 bit?
>>
>>When we tackled this problem for Portals, the ints were easy:  we assumed
>>endianness matched (we didn't care about heterogeneous systems).  We
>>assumed that MPI/SHMEM/whatever would have to figure out what "int" meant
>>and could cast to either int32_t or int64_t, as appropriate for the
>>compiler.  Floats and doubles were obvious (only one sizes of those).
>>LONG_DOUBLE was... intractable... so we punted and said that the Portals
>>library and MPI library would have to be compiled with compatible
>>compilers and compiler options if anybody used that.  It wasn't a great
>>solution, but the list Sean has below was unpleasant, and LONG_DOUBLE use
>>is, um, rare.
>>
>>Keith
>>
>>> -----Original Message-----
>>> From: ofiwg-bounces at lists.openfabrics.org [mailto:ofiwg-
>>> bounces at lists.openfabrics.org] On Behalf Of Hefty, Sean
>>> Sent: Tuesday, July 07, 2015 5:46 PM
>>> To: Jason Gunthorpe
>>> Cc: ofiwg at lists.openfabrics.org
>>> Subject: Re: [ofiwg] the problems with atomics
>>>
>>> > > FI_LONG_DOUBLE_80BITS_ALIGNED_96,
>>> > > FI_LONG_DOUBLE_80BITS_ALIGNED_128
>>> > > FI_LONG_DOUBLE_96BITS_ALIGNED_96,
>>> > > FI_LONG_DOUBLE_96BITS_ALIGNED_128,
>>> > > FI_LONG_DOUBLE_128BITS_ALIGNED_128,
>>> > > FI_LONG_DOUBLE_WEIRD_PPC_ALIGNED_128
>>> >
>>> > You expect app writers to provide this value? That sounds very hard.
>>>
>>> Long double seems ridiculous.  But from what I can tell, 2 apps on the
>>>same
>>> system could in theory have different alignments based on how they were
>>> compiled.  I don't want what I listed above as a solution at all, but I
>>>do need
>>> something.  At the moment, what I need to fix is the lack of any
>>>specific
>>> definition for 'FI_LONG_DOUBLE', which was added under the assumption
>>> that it had a sane IEEE definition...
>>>
>>> Maybe the best option is to just define the equivalent of:
>>>
>>> FLOAT32/64/96/128
>>>
>>> And map DOUBLE -> FLOAT64, and LONG_DOUBLE -> FLOAT128
>>>
>>> Figuring out if FLOAT96 is implemented using 80 bits would be left as an
>>> exercise to the app.
>>>
>>> > There is no implicit conversion when doing a RDMA READ/WRITE or SEND,
>>> > so the app is going to be exposed to all of the differences if it
>>> > wants to run homogeneously.
>>>
>>> RDMA and sends only deal with bytes.  They don't attempt to interpret
>>>data
>>> at the remote side.
>>>
>>> > I'd probably also say that the integer atomic types should have the
>>> > same endianness as the local machine (why should ATOMIC X vs RDMA
>>> READ
>>> > return different answers?)
>>>
>>> Even with RDMA, an app should care about endianness.  E.g. rsockets
>>> exchanges endianness format as part of its protocol, so that RDMA writes
>>> end up in a format that the remote side can read.  I'm guessing most
>>>apps
>>> ignore this.
>>>
>>> > Generally speaking, changing the alignment and size breaks all memory
>>> > layouts, I'd be amazed if a MPI or PGAS app could run across hardware
>>> > with different FPU and different integer representations.
>>>
>>> I agree, but that doesn't mean that I think the libfabric API should be
>>>broken.
>>> Apps should be able to run between different CPU architectures, though
>>>it
>>> will come with a cost.
>>>
>>> - Sean
>>> _______________________________________________
>>> ofiwg mailing list
>>> ofiwg at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/ofiwg
>>_______________________________________________
>>ofiwg mailing list
>>ofiwg at lists.openfabrics.org
>>http://lists.openfabrics.org/mailman/listinfo/ofiwg
>
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/ofiwg



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/



More information about the ofiwg mailing list