[ofa-general] Re: New proposal for memory management

Ralph Campbell ralph.campbell at qlogic.com
Wed Apr 29 13:39:24 PDT 2009


On Wed, 2009-04-29 at 13:25 -0700, Jason Gunthorpe wrote:
> On Wed, Apr 29, 2009 at 08:15:57AM -0400, Jeff Squyres wrote:
> > On Apr 29, 2009, at 12:03 AM, Jason Gunthorpe wrote:
> >
> >> I've often wondered, wouldn't it just be fine for MPI if the entire
> >> process address space is kept pinned, registered and consistent with
> >> the HCA? The process would opt in to this behavior during MPI
> >> startup. Similar in spirit to the all physical memory registration the
> >> kernel can do.
> 
> > Re-reading your brief text; I'm wondering if I missed the zen of what 
> > you're trying to suggest...?  If I'm off the mark, can you explain more?  
> > Thanks.
> 
> Ah yes, you went down the wrong path. I don't suggest doing anything
> with physical memory, but basically the equivalent of adding the
> result of every mmap() and sbrk/brk() call to the HCA mapping, and
> removing from the mapping at every call to munmap(), synchornously
> with those syscalls.
> 
> The net result would be that the verbs registration would follow the
> virtual memory allocation of the kernel. 
> 
> Basically, the API would work like this:
>   ibv_mr *mr = // some MR..
>   ibv_register_mr_all(mr);
> 
>   // At this point mr has all of /proc/self/maps included
> 
>   void *foo = mmap(...);
> 
>   // Before mmap returns, the equivilant of ibv_reg_mr(mr,foo..) is
>   //  done
> 
>   munmap(foo...);
>   // ibv_unreg_mr(mr,foo) is done..
> 
> 
> Essentially when this mode is enabled, mr always contains every
> virtual address in /proc/self/maps.
> 
> It is similar to the effect you get by calling mlockall();
> 
> The downside is that every byte of virtual memory in a MPI process
> must be pinned to physical ram before mmap() returns. You don't get
> to swap MPI jobs. (Well, perhaps there could be a new mmap flag to
> create un-registered memory that can be swapped for special needs)
> 
> Since this is done at mmap/brk time and not at page fault time it
> should not alter the performance of the MPI job unless it is doing
> alot of mmap calls for some reason (which is slow anyhow).
> 
> Jason

OK. This is a bit more reasonable. Putting this into my own words,
the HCA's mapping would mirror the application's VM to physical
mapping. Since the HCAs currently require this mapping to be fixed
between register/unregister, it would not be practical to pin this
amount of memory. It would require the dynamic mapping I mentioned
in reply to Ted Kim.




More information about the general mailing list