[ofa-general] New proposal for memory management
Jeff Squyres
jsquyres at cisco.com
Thu Apr 30 07:39:19 PDT 2009
On Apr 29, 2009, at 4:45 PM, Barrett, Brian W wrote:
> If you think this sounds like a hassle, think about what it looks
> like from
> the point of view of the MPI implementer (or any other developer
> writing
> libraries which sit between user data and OFED, like GASNet).
>
If you don't care about what pain MPI implementors have to go through
(and you probably don't ;-) ) -- consider that this is a major
roadblock to most *anyone* who wants to write to user verbs.
<banging the same old drum>
I heard lots of variations of "Why isn't OFED more popular?" in Sonoma
this year. This is at least one big reason why: no (normal/non-
superhuman programmers) can write verbs code (IMHO). MPI's *have* to
support OpenFabrics -- HPC customers demand it. But non-HPC customers
have a clear alternative: they'll just write sockets code. And the
price/performance for using sockets over IB/iWARP may or may not be
attractive depending on the customer's buying capacity. Hence -- they
just buy gigE (10gigE, when the price drops low enough).
Doesn't OpenFabrics want to grow beyond MPI? Woody said that verbs is
designed to support a billion different things -- outside of MPI and a
few storage protocols (none of which are widely adopted), how much is
OFED used?
</banging the same old drum>
> Jeff and I talked for a while today, and we're pretty sure that as
> long as
> the byte set by the kernel notifier is written before the pages are
> returned
> into the unallocated list, there isn't actually a race condition.
> [snip]
>
> However, there's still then the problem with the notifier concept of
> how the
> kernel passes which pages were given back to the kernel. It has to
> pass a
> (potentially very large) amount of data back to the user, so the
> memory
> ownership issues with kernel/user space are interesting. It also
> has to
> somewhat atomically prepare the list and undset the notifier byte,
> which is
> also problematic. But probably workable.
>
I feel compelled to amend this: this notifier concept *may be
workable*, but it's still quite complex for the reasons Brian cited.
The goal here is to *reduce* complexity, especially for applications/
ULPs using the verbs stack.
If we put the registration cache in the network stack, application/ULP
complexity will be reduced significantly. My $0.02 is that using a
notifier solution is still fairly complex and introduces a new set of
problems.
FWIW: Putting the registration cache in the userspace verbs stack
means that verbs will now have to do the horrid malloc/mmap/etc.
intercept tricks that MPI implementations currently do. Take it from
us -- this is not a business you want to be in. Such intercepts
breaks tools like valgrind and other memory-checking debuggers. Even
the best intercept hooks available today can still be subverted. Open
MPI (and MX!) has to insert a pre-main hook to setup these intercepts,
and then check later to ensure that no one else subverted our hooks.
Yuck.
It's memory management. And that belongs in the kernel.
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list