[ofa-general] New proposal for memory management

Barrett, Brian W bwbarre at sandia.gov
Wed Apr 29 13:45:23 PDT 2009


On 4/29/09 11:03 , "Roland Dreier" <rdreier at cisco.com> wrote:

>> But whacky situations might occur in a multithreaded application where
>> one thread calls free() while another thread calls malloc(), gets the
>> same virtual address that was just free()d but has not yet been
>> unregistered in the kernel, so a subsequent ibv_post_send() may
>> succeed but be sending the wrong data.
>> 
>> Put simply: in a multi-threaded application, there's always the chance
>> that the notify won't get to the user-level process until after the
>> global notifier variable has been checked, right?  Or, putting it the
>> other way: is there any kind of notify system that could be used that
>> *can't* create a potential race condition in a multi-threaded user
>> application?
> 
> Without thinking too much about the proposal (except that it adds a lot
> of new verb interfaces and a lot of kernel code, and therefore feels
> like a hassle to me), I don't see how this race is solved by moving a
> cache to the kernel.

If you think this sounds like a hassle, think about what it looks like from
the point of view of the MPI implementer (or any other developer writing
libraries which sit between user data and OFED, like GASNet).  We don't
write kernel modules, can't do much to change libc, and have to compete on
performance (particularly benchmarks that send large messages from the same
buffer).  We're forced into a library-level pin cache to get competitive
performance, but don't have the hooks to do it properly.  Instead, we try a
whole list of hacks to intercept free() and munmap() and hope for the best,
often missing.

And Open Fabrics is the only "commodity" interfaces that makes implementers
go through these pains.  Myrinet's MX, Cray's Portals, and Quadric's Tports
all handle the issues either at the driver library or kernel module level.

One statistic I like to point out (as a supporter of proper offload
interconnects and interfaces) is that there are 13,363 lines of code to
support InfiniBand within Open MPI, and that doesn't include logic for pin
caching, message matching, request management, or multi-nic striping.  There
are 4560 lines of code to support Cray Portals, and that includes all logic
for pin caching, message matching, request management, and multi-nic.  Guess
which one I think is more complex and feels like a hassle to me?

> If you have free()/malloc() of a buffer running in parallel with send
> operations targeting the same buffer, then that seems like a buggy MPI
> application.  Since free()/malloc() might not involve the kernel at all
> (the userspace library might keep its own free list, etc) I don't see
> how a registration cache in the kernel would help anyway.
> 
> Now, since free()/malloc() operations must be serialized with respect to
> send/receive operations in userspace anyway, I don't see why a simpler
> (and possibly more flexible/powerful) kernel notifier design can't
> work -- if free() releases virtual memory back to the kernel, then the
> kernel notifier will run before the free() call returns, so things
> should work as planned.

Jeff and I talked for a while today, and we're pretty sure that as long as
the byte set by the kernel notifier is written before the pages are returned
into the unallocated list, there isn't actually a race condition.  It does
mean that every time the page cache is searched, we also have to check the
byte (and likely take a cache miss), but that's not too evil.

However, there's still then the problem with the notifier concept of how the
kernel passes which pages were given back to the kernel.  It has to pass a
(potentially very large) amount of data back to the user, so the memory
ownership issues with kernel/user space are interesting.  It also has to
somewhat atomically prepare the list and undset the notifier byte, which is
also problematic.  But probably workable.

So perhaps the notifier method would be sufficient after all.

Brian

--
   Brian W. Barrett
   Dept. 1423: Scalable System Software
   Sandia National Laboratories




More information about the general mailing list