[ofa-general] Re: New proposal for memory management
Steven Truelove
truelove at array.ca
Thu Apr 30 05:49:30 PDT 2009
John A. Gregor wrote:
> So, how about this:
>
> Maintain a pool of pre-pinned pages.
>
> When an RTS comes in, use one of the pre-pinned buffers as the place the
> DATA will land. Set up the remaining hw context to enable receipt into
> the page(s) and fire back your CTS.
>
> While the CTS is in flight and the DATA is streaming back (and you
> therefore have a couple microseconds to play with), remap the virt-to-phys
> mapping of the application so that the original virtual address now
> points at the pre-pinned page.
A big part of the performance improvement associated with RDMA is
avoiding constant page remappings and data copies. If pinning the
physical/virtual memory mapping was cheap enough to do this for each
message, MPI applications could simply pin and register the mapping when
sending/receiving each message and then unmap when the operation was
complete. MPI implementations maintain a cache of what memory has been
registered because it is too expensive to map/unmap/remap memory constantly.
Copying parts of the page(s) not involved in the transfer would also
raise overhead quite a bit for smaller RDMAs. It is quite easy to see a
5 or 6K message requiring a 2-3K copy to fix the rest of a page. And
heaven help those systems with huge pages, ~1MB, in such a case.
I have seen this problem in our own MPI application. The 'simple'
solution I have seen used in at least one MPI implementation for this
problem is to prevent the malloc/free implementation being used from
ever returning memory to the OS. The virtual/physical mapping can only
become invalid if virtual addresses are given back to the OS, then
returned with different physical pages. Under Linux with at least, it
is quite easy to tell libc to never return memory to the OS. In this
case free() and similar functions will simply retain the memory for use
with future malloc (and similar) calls. Because the memory is never
unpinned and never given back to the OS, the physical virtual mapping is
consistent forever. I don't if other OSes make this as easy, or even
what systems most MPI implementors want their software to run on.
The obvious downside to this is that a process with highly irregular
memory demand will always have the memory usage of its previous peak.
And because the memory is pinned, it will not even be swapped out, and
will count against the memory pinning ulimit. For many MPI applications
that is not a problem -- they often have quite fixed memory usage and
wouldn't be returning much if any memory to the OS anyway. This is the
case for our application. I imagine someone out there has some job that
doesn't behave so neatly, of course.
Steven Truelove
More information about the general
mailing list