[openib-general] A critique of RDMA PUT/GET in HPC

Michael Krause krause at cup.hp.com
Tue Aug 29 07:53:55 PDT 2006


At 08:56 AM 8/25/2006, Greg Lindahl wrote:
>On Fri, Aug 25, 2006 at 10:13:01AM -0500, Tom Tucker wrote:
>
> > He does say this, but his analysis does not support this conclusion. His
> > analysis revolves around MPI send/recv, not the MPI 2.0 get/put
> > services.
>
>Nobody uses MPI put/get anyway, so leaving out analyzing that doesn't
>change reality much.

Is this due to legacy or other reasons?  One reason cited from Winsocks 
Direct for using the bcopy vs. the RDMA zcopy operations was the cost to 
register memory if done on a per operation basis, i.e. single use.  The 
bcopy threshold was ~9KB.  With the new verbs developed for iWARP and then 
added to IB v1.2, the bcopy threshold was reduced to ~1KB.

Now, if I recall correctly, many MPI implementations split their buffer 
usage between what are often 1KB envelopes and what are large regions.  One 
can persistently register the envelopes so their size does not really 
matter and thus could use send / receive or RDMA semantics for their update 
depending upon how the completions are managed.  The larger data movements 
can be RDMA semantics if desired as these are typically large in size.


> > A valid conclusion IMO is that "MPI send/recv can
> > be most efficiently implemented over an unconnected reliable datagram
> > protocol that supports 64bit tag matching at the data sink." And not
> > coincidentally, Myricom has this ;-)
>
>As do all of the non-VIA-family interconnects he mentions.  Since "we"
>all landed on the same conclusion, you might think we're on to
>something. Or not.

We've had this argument multiple times and examined all of the known and 
relatively volume usage models which includes the suite of MPI benchmarks 
used to evaluate and drive implementations.  Any interconnect architecture 
is one of compromise if it is to be used in a volume environment - the goal 
for the architects is to insure the compromises do not result in a 
brain-dead or too diminished technology that will not meet customer 
requirements.

With respect to reliable datagram, unless one does software multiplexing 
over what amounts to a reliable connection which comes with a performance 
penalty as well as complexity in terms of error recover, etc. logic it 
really does not buy one anything better than a RC model used today.  Given 
the application mix and the customer usage model, IB provided four 
transport types to meet different application needs and allow people to 
make choices.  iWARP reduced this to one since the target applications 
really were met with RC and reliable datagram as defined in IB simply was 
not being picked up or demanded by the targeted ISV.  While some of us had 
argued for the software multiplex model, others wanted everything to be 
implemented in hardware so IB is what it is today.   In any case, it is one 
of a set of reasonable compromises and for the most part, I contend it is 
difficult to argue that these interconnect technologies are so compromised 
that they are brain dead or broken.

>However, that's only part of the argument.  Another part is that the
>buffer space needed to use RDMA put/get for all data links is huge.
>And there are some other interesting points.

The buffer and context differences to track RDMA vs. Send are not 
significant in terms of hardware.  In terms of software, memory needs to be 
registered in some capacity to perform DMA to it and hence, there is a cost 
from the OS / application perspective.  Our goals were to be able to use 
application buffers to provide zero copy data movements as well as OS 
bypass.  RDMA vs. Send does not incrementally differ in terms of resource 
costs in the end.


> > I DO agree that it is interesting reading. :-), it's definitely got
> > people fired up.
>
>Heh. Glad you found it interesting.

The article is somewhat interesting but does not really present anything 
novel in this on-going debate on how interconnects should be 
designed.   There will always be someone pointing out a particular issue 
here and there and in the end, many of these amount to mouse nuts when 
placed into the larger context.  When they don't, a new interconnect is 
defined or extensions are made to compensate as nothing is ever permanent 
or perfect.

Mike 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060829/bf91a152/attachment.html>


More information about the general mailing list