[ofa-general] New proposal for memory management

Jeff Squyres jsquyres at cisco.com
Thu Apr 30 07:39:19 PDT 2009


On Apr 29, 2009, at 4:45 PM, Barrett, Brian W wrote:

> If you think this sounds like a hassle, think about what it looks  
> like from
> the point of view of the MPI implementer (or any other developer  
> writing
> libraries which sit between user data and OFED, like GASNet).
>

If you don't care about what pain MPI implementors have to go through  
(and you probably don't ;-) ) -- consider that this is a major  
roadblock to most *anyone* who wants to write to user verbs.

<banging the same old drum>

I heard lots of variations of "Why isn't OFED more popular?" in Sonoma  
this year.  This is at least one big reason why: no (normal/non- 
superhuman programmers) can write verbs code (IMHO).  MPI's *have* to  
support OpenFabrics -- HPC customers demand it.  But non-HPC customers  
have a clear alternative: they'll just write sockets code.  And the  
price/performance for using sockets over IB/iWARP may or may not be  
attractive depending on the customer's buying capacity.  Hence -- they  
just buy gigE (10gigE, when the price drops low enough).

Doesn't OpenFabrics want to grow beyond MPI?  Woody said that verbs is  
designed to support a billion different things -- outside of MPI and a  
few storage protocols (none of which are widely adopted), how much is  
OFED used?

</banging the same old drum>

> Jeff and I talked for a while today, and we're pretty sure that as  
> long as
> the byte set by the kernel notifier is written before the pages are  
> returned
> into the unallocated list, there isn't actually a race condition.  
> [snip]
>
> However, there's still then the problem with the notifier concept of  
> how the
> kernel passes which pages were given back to the kernel.  It has to  
> pass a
> (potentially very large) amount of data back to the user, so the  
> memory
> ownership issues with kernel/user space are interesting.  It also  
> has to
> somewhat atomically prepare the list and undset the notifier byte,  
> which is
> also problematic.  But probably workable.
>


I feel compelled to amend this: this notifier concept *may be  
workable*, but it's still quite complex for the reasons Brian cited.   
The goal here is to *reduce* complexity, especially for applications/ 
ULPs using the verbs stack.

If we put the registration cache in the network stack, application/ULP  
complexity will be reduced significantly.  My $0.02 is that using a  
notifier solution is still fairly complex and introduces a new set of  
problems.

FWIW: Putting the registration cache in the userspace verbs stack  
means that verbs will now have to do the horrid malloc/mmap/etc.  
intercept tricks that MPI implementations currently do.  Take it from  
us -- this is not a business you want to be in.  Such intercepts  
breaks tools like valgrind and other memory-checking debuggers.  Even  
the best intercept hooks available today can still be subverted.  Open  
MPI (and MX!) has to insert a pre-main hook to setup these intercepts,  
and then check later to ensure that no one else subverted our hooks.   
Yuck.

It's memory management.  And that belongs in the kernel.

-- 
Jeff Squyres
Cisco Systems




More information about the general mailing list