[ofa-general] Memory registration redux

Jeff Squyres jsquyres at cisco.com
Mon May 18 11:24:33 PDT 2009


On May 18, 2009, at 2:02 PM, Caitlin Bestler wrote:

> >>  > Specifically: the actual dereg of 0x1000-0x3fff is blocked on  
> also
> >>  > releasing 0x2000-0x2fff.
> >>
> >> If everyone is doing this, how do you handle the case that Jason  
> pointed
> >> out, namely:
> >>
> >>  * you register 0x1000 ... 0x3fff
> >>  * you want to register 0x2000 ... 0x2fff and have a cache hit
> >>  * you finish up with 0x1000 ... 0x3fff
> >>  * app does something (which is valid since you finished up with  
> the
> >>   bigger range) that invalidates mapping 0x3000 ... 0x3fff (eg  
> free()
> >>   that leads to munmap() or whatever), and your hooks tell you so.
> >>  * app reallocates a mapping in 0x3000 ... 0x3fff
> >>  * you want to re-register 0x1000 ... 0x3fff -- but it has to be  
> marked
> >>   both invalid and in-use in the cache at this point !?
>

I think I mis-parsed the above scenario in my previous response.

When our memory hooks tell us that memory is about to be removed from  
the process, we unregister all pages in the relevant region and remove  
those entries from the cache.  So the next time you look in the cache  
for 0x3000-0x3fff, it won't be there -- it'll be treated as cache-cold.

> How does 0x1000 to 0x3fff get registered as a single Memory Region?
> If it is legitimate to free() 0x3000..0x3fff then how can there ever  
> be a
> legitimate reference to 0x1000..0x3fff? If there is no such single  
> reference,
> I don't see how a Memory Region is every created covering that range.
>
> If the user creates the Memory Region, then they are responsible for  
> not
> free()ing a portion of it.
>

Agreed.  If an application does that, it deserves what it gets.

> Would the MPI library ever create a single large memory region based  
> on
> two distinct Sends?
>


Per my prior mail, Open MPI registers chucks at a time.  Each chunk is  
potentially a multiple of pages.  So yes, you could end up having a  
single registration that spans the buffers used in multiple, distinct  
MPI sends.  We reference count by page to ensure that deregistrations  
do not occur prematurely.

For example, if page X contains the end of one large buffer and the  
beginning of another, both of which are being used in ongoing non- 
blocking MPI communications.  Then page X's entry on our cache will  
have a refcount == 2.  OMPI won't allow the registration containing  
that page to become eligible for deregistering until the cache entry's  
refcount goes down to 0.

See my prior mail for a more complex example of our cache's behavior.

-- 
Jeff Squyres
Cisco Systems




More information about the general mailing list