[ofa-general] Re: New proposal for memory management

arkady kanevsky arkady.kanevsky at gmail.com
Thu Apr 30 06:24:51 PDT 2009


Jeff,had you considered a notion of buffer and buffer iteration introduced
by MPI/RT (The Real-Time Message Passing Interface Standard,
in Concurency and Computation: Practice and Experience,
Volume 16, N0 S1, pp S1-S332, Dec 2004; see Chapter 5).
It basically sets up a contract of buffer (and underlying memory)
ownership between MPI implementation and user.
Arkady

On Thu, Apr 30, 2009 at 8:49 AM, Steven Truelove <truelove at array.ca> wrote:

>
>
> John A. Gregor wrote:
>
>> So, how about this:
>>
>> Maintain a pool of pre-pinned pages.
>>
>> When an RTS comes in, use one of the pre-pinned buffers as the place the
>> DATA will land.  Set up the remaining hw context to enable receipt into
>> the page(s) and fire back your CTS.
>>
>> While the CTS is in flight and the DATA is streaming back (and you
>> therefore have a couple microseconds to play with), remap the virt-to-phys
>> mapping of the application so that the original virtual address now
>> points at the pre-pinned page.
>>
>
> A big part of the performance improvement associated with RDMA is avoiding
> constant page remappings and data copies.  If pinning the physical/virtual
> memory mapping was cheap enough to do this for each message, MPI
> applications could simply pin and register the mapping when
> sending/receiving each message and then unmap when the operation was
> complete.  MPI implementations maintain a cache of what memory has been
> registered because it is too expensive to map/unmap/remap memory constantly.
>
> Copying parts of the page(s) not involved in the transfer would also raise
> overhead quite a bit for smaller RDMAs.  It is quite easy to see a 5 or 6K
> message requiring a 2-3K copy to fix the rest of a page.  And heaven help
> those systems with huge pages, ~1MB, in such a case.
>
> I have seen this problem in our own MPI application.  The 'simple' solution
> I have seen used in at least one MPI implementation for this problem is to
> prevent the malloc/free implementation being used from ever returning memory
> to the OS.  The virtual/physical mapping can only become invalid if virtual
> addresses are given back to the OS, then returned with different physical
> pages.  Under Linux with at least, it is quite easy to tell libc to never
> return memory to the OS.  In this case free() and similar functions will
> simply retain the memory for use with future malloc (and similar) calls.
>  Because the memory is never unpinned and never given back to the OS, the
> physical virtual mapping is consistent forever.  I don't if other OSes make
> this as easy, or even what systems most MPI implementors want their software
> to run on.
>
> The obvious downside to this is that a process with highly irregular memory
> demand will always have the memory usage of its previous peak.  And because
> the memory is pinned, it will not even be swapped out, and will count
> against the memory pinning ulimit.  For many MPI applications that is not a
> problem -- they often have quite fixed memory usage and wouldn't be
> returning much if any memory to the OS anyway.  This is the case for our
> application.  I imagine someone out there has some job that doesn't behave
> so neatly, of course.
>
>
> Steven Truelove
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>



-- 
Cheers,
Arkady Kanevsky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090430/2984aeaf/attachment.html>


More information about the general mailing list