[ofa-general] Re: New proposal for memory management

Thu Apr 30 05:49:30 PDT 2009

John A. Gregor wrote:
> So, how about this:
>
> Maintain a pool of pre-pinned pages.
>
> When an RTS comes in, use one of the pre-pinned buffers as the place the
> DATA will land.  Set up the remaining hw context to enable receipt into
> the page(s) and fire back your CTS.
>
> While the CTS is in flight and the DATA is streaming back (and you
> therefore have a couple microseconds to play with), remap the virt-to-phys
> mapping of the application so that the original virtual address now
> points at the pre-pinned page.

A big part of the performance improvement associated with RDMA is 
avoiding constant page remappings and data copies.  If pinning the 
physical/virtual memory mapping was cheap enough to do this for each 
message, MPI applications could simply pin and register the mapping when 
sending/receiving each message and then unmap when the operation was 
complete.  MPI implementations maintain a cache of what memory has been 
registered because it is too expensive to map/unmap/remap memory constantly.

Copying parts of the page(s) not involved in the transfer would also 
raise overhead quite a bit for smaller RDMAs.  It is quite easy to see a 
5 or 6K message requiring a 2-3K copy to fix the rest of a page.  And 
heaven help those systems with huge pages, ~1MB, in such a case.

I have seen this problem in our own MPI application.  The 'simple' 
solution I have seen used in at least one MPI implementation for this 
problem is to prevent the malloc/free implementation being used from 
ever returning memory to the OS.  The virtual/physical mapping can only 
become invalid if virtual addresses are given back to the OS, then 
returned with different physical pages.  Under Linux with at least, it 
is quite easy to tell libc to never return memory to the OS.  In this 
case free() and similar functions will simply retain the memory for use 
with future malloc (and similar) calls.  Because the memory is never 
unpinned and never given back to the OS, the physical virtual mapping is 
consistent forever.  I don't if other OSes make this as easy, or even 
what systems most MPI implementors want their software to run on.

The obvious downside to this is that a process with highly irregular 
memory demand will always have the memory usage of its previous peak.  
And because the memory is pinned, it will not even be swapped out, and 
will count against the memory pinning ulimit.  For many MPI applications 
that is not a problem -- they often have quite fixed memory usage and 
wouldn't be returning much if any memory to the OS anyway.  This is the 
case for our application.  I imagine someone out there has some job that 
doesn't behave so neatly, of course.

Steven Truelove