[ofa-general] Re: New proposal for memory management

Jason Gunthorpe jgunthorpe at obsidianresearch.com
Wed Apr 29 13:25:30 PDT 2009


On Wed, Apr 29, 2009 at 08:15:57AM -0400, Jeff Squyres wrote:
> On Apr 29, 2009, at 12:03 AM, Jason Gunthorpe wrote:
>
>> I've often wondered, wouldn't it just be fine for MPI if the entire
>> process address space is kept pinned, registered and consistent with
>> the HCA? The process would opt in to this behavior during MPI
>> startup. Similar in spirit to the all physical memory registration the
>> kernel can do.

> Re-reading your brief text; I'm wondering if I missed the zen of what 
> you're trying to suggest...?  If I'm off the mark, can you explain more?  
> Thanks.

Ah yes, you went down the wrong path. I don't suggest doing anything
with physical memory, but basically the equivalent of adding the
result of every mmap() and sbrk/brk() call to the HCA mapping, and
removing from the mapping at every call to munmap(), synchornously
with those syscalls.

The net result would be that the verbs registration would follow the
virtual memory allocation of the kernel. 

Basically, the API would work like this:
  ibv_mr *mr = // some MR..
  ibv_register_mr_all(mr);

  // At this point mr has all of /proc/self/maps included

  void *foo = mmap(...);

  // Before mmap returns, the equivilant of ibv_reg_mr(mr,foo..) is
  //  done

  munmap(foo...);
  // ibv_unreg_mr(mr,foo) is done..


Essentially when this mode is enabled, mr always contains every
virtual address in /proc/self/maps.

It is similar to the effect you get by calling mlockall();

The downside is that every byte of virtual memory in a MPI process
must be pinned to physical ram before mmap() returns. You don't get
to swap MPI jobs. (Well, perhaps there could be a new mmap flag to
create un-registered memory that can be swapped for special needs)

Since this is done at mmap/brk time and not at page fault time it
should not alter the performance of the MPI job unless it is doing
alot of mmap calls for some reason (which is slow anyhow).

Jason



More information about the general mailing list