[ofa-general] New proposal for memory management
Barrett, Brian W
bwbarre at sandia.gov
Thu Apr 30 10:24:18 PDT 2009
On 4/30/09 11:09 , "Woodruff, Robert J" <robert.j.woodruff at intel.com> wrote:
> Jeff wrote,
>> I would be extremely hesitant to have an OpenFabrics-provided library
>> do this. MPI implementations spend a *lot* of time an effort on this
>> section of code because it is *the* heart of the MPI message passing
>> engine. To be blunt: here is not enough MPI expertise in the current
>> set of OpenFabrics developers to build such a library. I doubt that
>> the academic and proprietary MPI implementations would want to
>> contribute resources to make one, either (it's their secret sauce!).
> Interesting that you would want the OFA developers to implement a
> memory registration cache and think they could manage the registration
> of MPI memory better than MPI can, but then say that tag-matching drivers
> in MPI are their secret sauce. Seems like registration caching is also
> some of various MPI's secret sauce.
I somewhat disagree with Jeff - I'd love to see OFA implement tag-matching,
as we MPI implementors can (optionally) use it to save development time and
then pound on the hardware guys until they actually implement proper tag
matching and offload in hardware. But my goals are driven by a slightly
different market than the rest of the planet (ie, huge machines that
actually work when running a single 10k-20k process job).
The registration caching isn't really secret sauce. It's more like the
residue that forms around the cap to the secret sauce bottle. We have to do
it to get good performance, it doesn't work reliably, and we can't fix it.
I have all the information I need to do tag matching properly on the main
processor. I don't have all the information I need to write a registration
cache. I can't reliably know when memory is going back to the OS (because
there still isn't a 100% foolproof way of intercepting when memory is given
back to the OS). It's also used in long messages, where a couple hundred
nanoseconds of added latency aren't critical, as opposed to tag matching,
which is in the critical path of short messages and a couple extra tens of
nanoseconds is a deal breaker.
>> Indeed, to make such a proposal work, there would, by definition, have
>> to be new hardware capabilities, and therefore new verbs to support
>> those hardware capabilities. So this might just end up as new verbs
>> anyway -- not a new middleware library.
> Yes, new hardware capabilities would be needed for this and it is always
> hard to get new hardware features added, but if they were added to some
> future IBTA or iWarp spec, I think it would be good for MPIs, as we have
> seen that this is the way other interconnects like myrinet can achieve good
> performance for MPI applications.
> Anyway, just thought I would bring it up as a possibility for solving
> some of the issues that you raised at Sonoma.
I don't think it actually solves any of the problems. Assuming it's like
other verbs, you still have to deal with memory registration caches.
There's still registered memory somewhere, so fork() is still going to be
problematic. While it might not require an RC QP, it's going to require
some kind of QP, so CM setup is still a problem. API Portability will still
be a problem. It might solve the reliable connectionless problem, since I
could envision a new QP type to support the tag matching. And there's still
the problem of how unexpected receives are handled and how much space it
In short, while I'd love to see tag matching, I'd rather make sure all the
other issues get solved properly first. Otherwise, we've just added another
interface that drives me up the wall.
Brian W. Barrett
Dept. 1423: Scalable System Software
Sandia National Laboratories
More information about the general