[ofa-general] Memory registration redux

Jeff Squyres jsquyres at cisco.com
Thu May 7 06:54:26 PDT 2009


On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:

> By the way, what's the desired behavior of the cache if a process
> registers, say, address range 0x1000 ... 0x3fff, and then the same
> process registers address range 0x2000 ... 0x2fff (with all the same
> permissions, etc)?
>
> The initial registration creates an MR that is still valid for the
> smaller virtual address range, so the second registration is much
> cheaper if we used the cached registration; but if we use the cache  
> for
> the second registration, and then deregister the first one, we're  
> stuck
> with a too-big range pinned in the cache because of the second
> registration.
>


I don't know what the other MPI's do in this scenario, but here's what  
OMPI will do:

1. lookup 0x1000-0x3fff in the cache; not find any of it it, and  
therefore register
    - add each page to our cache with a refcount of 1
2. lookup 0x2000-0x2fff in the cache, find that all the pages are  
already registered
    - refcount++ on each page in the cache
3. when we go to dereg 0x1000-0x3fff
    - refcount-- on each page in the cache
    - since some pages in the range still have refcount>0, don't do  
anything further

Specifically: the actual dereg of 0x1000-0x3fff is blocked on also  
releasing 0x2000-0x2fff.

Note that OMPI will only register a max of X bytes at a time (where X  
defaults to 2MB).  So even if a user calls MPI_SEND(...) with an  
enormous buffer, we'll register it X/page_size pages at a time, not  
the entire buffer at once.  Hence, the "buffer A is blocked from  
dereg'ing by buffer B" scenario is *somewhat* mitigated -- it's less  
wasteful than if we can registered/cached the entire huge buffer at  
once.

Finally, note that if 0x2000-0x2fff had not been registered, the  
0x1000-0x3fff pages are not actually deregistered when all the pages'  
refcounts go to 0 -- they are just moved to the "able to be dereg'ed  
list".  We don't actually dereg it until we later try to reg new  
memory and fail due to lack of resources.  Then we take entries off  
the "able to be dereg'ed list" and dereg them, then try reg'ing the  
new memory again.

MVAPICH: do you guys do similar things?

(I don't know if HP/Scali/Intel will comment on their registration  
cache schemes)

-- 
Jeff Squyres
Cisco Systems




More information about the general mailing list