[ofa-general] Memory registration redux
Jeff Squyres
jsquyres at cisco.com
Thu May 7 06:54:26 PDT 2009
On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
> By the way, what's the desired behavior of the cache if a process
> registers, say, address range 0x1000 ... 0x3fff, and then the same
> process registers address range 0x2000 ... 0x2fff (with all the same
> permissions, etc)?
>
> The initial registration creates an MR that is still valid for the
> smaller virtual address range, so the second registration is much
> cheaper if we used the cached registration; but if we use the cache
> for
> the second registration, and then deregister the first one, we're
> stuck
> with a too-big range pinned in the cache because of the second
> registration.
>
I don't know what the other MPI's do in this scenario, but here's what
OMPI will do:
1. lookup 0x1000-0x3fff in the cache; not find any of it it, and
therefore register
- add each page to our cache with a refcount of 1
2. lookup 0x2000-0x2fff in the cache, find that all the pages are
already registered
- refcount++ on each page in the cache
3. when we go to dereg 0x1000-0x3fff
- refcount-- on each page in the cache
- since some pages in the range still have refcount>0, don't do
anything further
Specifically: the actual dereg of 0x1000-0x3fff is blocked on also
releasing 0x2000-0x2fff.
Note that OMPI will only register a max of X bytes at a time (where X
defaults to 2MB). So even if a user calls MPI_SEND(...) with an
enormous buffer, we'll register it X/page_size pages at a time, not
the entire buffer at once. Hence, the "buffer A is blocked from
dereg'ing by buffer B" scenario is *somewhat* mitigated -- it's less
wasteful than if we can registered/cached the entire huge buffer at
once.
Finally, note that if 0x2000-0x2fff had not been registered, the
0x1000-0x3fff pages are not actually deregistered when all the pages'
refcounts go to 0 -- they are just moved to the "able to be dereg'ed
list". We don't actually dereg it until we later try to reg new
memory and fail due to lack of resources. Then we take entries off
the "able to be dereg'ed list" and dereg them, then try reg'ing the
new memory again.
MVAPICH: do you guys do similar things?
(I don't know if HP/Scali/Intel will comment on their registration
cache schemes)
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list