[ofiwg] the problems with atomics

Wed Jul 8 04:10:06 PDT 2015

Technically, MPI implementations are supposed to detect the heterogeneity and deal with it (e.g. for the endianness problem).

For LONG_DOUBLE, MPI essentially makes the assumption that the app and MPI are compiled with "compatible" compilers and options regarding LONG_DOUBLE formats.  Then again, the same is kind of true for the definition of things like MPI_INT and MPI_LONG_INT, right?  Have we all forgotten the pain of the transition to 64 bit?

When we tackled this problem for Portals, the ints were easy:  we assumed endianness matched (we didn't care about heterogeneous systems).  We assumed that MPI/SHMEM/whatever would have to figure out what "int" meant and could cast to either int32_t or int64_t, as appropriate for the compiler.  Floats and doubles were obvious (only one sizes of those).  LONG_DOUBLE was... intractable... so we punted and said that the Portals library and MPI library would have to be compiled with compatible compilers and compiler options if anybody used that.  It wasn't a great solution, but the list Sean has below was unpleasant, and LONG_DOUBLE use is, um, rare.

Keith

> -----Original Message-----
> From: ofiwg-bounces at lists.openfabrics.org [mailto:ofiwg-
> bounces at lists.openfabrics.org] On Behalf Of Hefty, Sean
> Sent: Tuesday, July 07, 2015 5:46 PM
> To: Jason Gunthorpe
> Cc: ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] the problems with atomics
> 
> > > FI_LONG_DOUBLE_80BITS_ALIGNED_96,
> > > FI_LONG_DOUBLE_80BITS_ALIGNED_128
> > > FI_LONG_DOUBLE_96BITS_ALIGNED_96,
> > > FI_LONG_DOUBLE_96BITS_ALIGNED_128,
> > > FI_LONG_DOUBLE_128BITS_ALIGNED_128,
> > > FI_LONG_DOUBLE_WEIRD_PPC_ALIGNED_128
> >
> > You expect app writers to provide this value? That sounds very hard.
> 
> Long double seems ridiculous.  But from what I can tell, 2 apps on the same
> system could in theory have different alignments based on how they were
> compiled.  I don't want what I listed above as a solution at all, but I do need
> something.  At the moment, what I need to fix is the lack of any specific
> definition for 'FI_LONG_DOUBLE', which was added under the assumption
> that it had a sane IEEE definition...
> 
> Maybe the best option is to just define the equivalent of:
> 
> FLOAT32/64/96/128
> 
> And map DOUBLE -> FLOAT64, and LONG_DOUBLE -> FLOAT128
> 
> Figuring out if FLOAT96 is implemented using 80 bits would be left as an
> exercise to the app.
> 
> > There is no implicit conversion when doing a RDMA READ/WRITE or SEND,
> > so the app is going to be exposed to all of the differences if it
> > wants to run homogeneously.
> 
> RDMA and sends only deal with bytes.  They don't attempt to interpret data
> at the remote side.
> 
> > I'd probably also say that the integer atomic types should have the
> > same endianness as the local machine (why should ATOMIC X vs RDMA
> READ
> > return different answers?)
> 
> Even with RDMA, an app should care about endianness.  E.g. rsockets
> exchanges endianness format as part of its protocol, so that RDMA writes
> end up in a format that the remote side can read.  I'm guessing most apps
> ignore this.
> 
> > Generally speaking, changing the alignment and size breaks all memory
> > layouts, I'd be amazed if a MPI or PGAS app could run across hardware
> > with different FPU and different integer representations.
> 
> I agree, but that doesn't mean that I think the libfabric API should be broken.
> Apps should be able to run between different CPU architectures, though it
> will come with a cost.
> 
> - Sean
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/ofiwg