[openib-general] A critique of RDMA PUT/GET in HPC
Michael Krause
krause at cup.hp.com
Tue Aug 29 07:53:55 PDT 2006
At 08:56 AM 8/25/2006, Greg Lindahl wrote:
>On Fri, Aug 25, 2006 at 10:13:01AM -0500, Tom Tucker wrote:
>
> > He does say this, but his analysis does not support this conclusion. His
> > analysis revolves around MPI send/recv, not the MPI 2.0 get/put
> > services.
>
>Nobody uses MPI put/get anyway, so leaving out analyzing that doesn't
>change reality much.
Is this due to legacy or other reasons? One reason cited from Winsocks
Direct for using the bcopy vs. the RDMA zcopy operations was the cost to
register memory if done on a per operation basis, i.e. single use. The
bcopy threshold was ~9KB. With the new verbs developed for iWARP and then
added to IB v1.2, the bcopy threshold was reduced to ~1KB.
Now, if I recall correctly, many MPI implementations split their buffer
usage between what are often 1KB envelopes and what are large regions. One
can persistently register the envelopes so their size does not really
matter and thus could use send / receive or RDMA semantics for their update
depending upon how the completions are managed. The larger data movements
can be RDMA semantics if desired as these are typically large in size.
> > A valid conclusion IMO is that "MPI send/recv can
> > be most efficiently implemented over an unconnected reliable datagram
> > protocol that supports 64bit tag matching at the data sink." And not
> > coincidentally, Myricom has this ;-)
>
>As do all of the non-VIA-family interconnects he mentions. Since "we"
>all landed on the same conclusion, you might think we're on to
>something. Or not.
We've had this argument multiple times and examined all of the known and
relatively volume usage models which includes the suite of MPI benchmarks
used to evaluate and drive implementations. Any interconnect architecture
is one of compromise if it is to be used in a volume environment - the goal
for the architects is to insure the compromises do not result in a
brain-dead or too diminished technology that will not meet customer
requirements.
With respect to reliable datagram, unless one does software multiplexing
over what amounts to a reliable connection which comes with a performance
penalty as well as complexity in terms of error recover, etc. logic it
really does not buy one anything better than a RC model used today. Given
the application mix and the customer usage model, IB provided four
transport types to meet different application needs and allow people to
make choices. iWARP reduced this to one since the target applications
really were met with RC and reliable datagram as defined in IB simply was
not being picked up or demanded by the targeted ISV. While some of us had
argued for the software multiplex model, others wanted everything to be
implemented in hardware so IB is what it is today. In any case, it is one
of a set of reasonable compromises and for the most part, I contend it is
difficult to argue that these interconnect technologies are so compromised
that they are brain dead or broken.
>However, that's only part of the argument. Another part is that the
>buffer space needed to use RDMA put/get for all data links is huge.
>And there are some other interesting points.
The buffer and context differences to track RDMA vs. Send are not
significant in terms of hardware. In terms of software, memory needs to be
registered in some capacity to perform DMA to it and hence, there is a cost
from the OS / application perspective. Our goals were to be able to use
application buffers to provide zero copy data movements as well as OS
bypass. RDMA vs. Send does not incrementally differ in terms of resource
costs in the end.
> > I DO agree that it is interesting reading. :-), it's definitely got
> > people fired up.
>
>Heh. Glad you found it interesting.
The article is somewhat interesting but does not really present anything
novel in this on-going debate on how interconnects should be
designed. There will always be someone pointing out a particular issue
here and there and in the end, many of these amount to mouse nuts when
placed into the larger context. When they don't, a new interconnect is
defined or extensions are made to compensate as nothing is ever permanent
or perfect.
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060829/bf91a152/attachment.html>
More information about the general
mailing list