[ofa-general] Memory registration redux
Tang, Changqing
changquing.tang at hp.com
Thu May 7 09:07:05 PDT 2009
HP-MPI is pretty much doing the similar thing. --CQ
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
> Jeff Squyres
> Sent: Thursday, May 07, 2009 8:54 AM
> To: Roland Dreier
> Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny
> Verkhovsky; HÃ¥kon Bugge; Donald Kerr; OpenFabrics General;
> Alexander Supalov
> Subject: Re: [ofa-general] Memory registration redux
>
> On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
>
> > By the way, what's the desired behavior of the cache if a process
> > registers, say, address range 0x1000 ... 0x3fff, and then the same
> > process registers address range 0x2000 ... 0x2fff (with all
> the same
> > permissions, etc)?
> >
> > The initial registration creates an MR that is still valid for the
> > smaller virtual address range, so the second registration is much
> > cheaper if we used the cached registration; but if we use the cache
> > for the second registration, and then deregister the first
> one, we're
> > stuck with a too-big range pinned in the cache because of
> the second
> > registration.
> >
>
>
> I don't know what the other MPI's do in this scenario, but
> here's what OMPI will do:
>
> 1. lookup 0x1000-0x3fff in the cache; not find any of it it,
> and therefore register
> - add each page to our cache with a refcount of 1 2.
> lookup 0x2000-0x2fff in the cache, find that all the pages
> are already registered
> - refcount++ on each page in the cache 3. when we go to
> dereg 0x1000-0x3fff
> - refcount-- on each page in the cache
> - since some pages in the range still have refcount>0,
> don't do anything further
>
> Specifically: the actual dereg of 0x1000-0x3fff is blocked on
> also releasing 0x2000-0x2fff.
>
> Note that OMPI will only register a max of X bytes at a time
> (where X defaults to 2MB). So even if a user calls
> MPI_SEND(...) with an enormous buffer, we'll register it
> X/page_size pages at a time, not the entire buffer at once.
> Hence, the "buffer A is blocked from dereg'ing by buffer B"
> scenario is *somewhat* mitigated -- it's less wasteful than
> if we can registered/cached the entire huge buffer at once.
>
> Finally, note that if 0x2000-0x2fff had not been registered,
> the 0x1000-0x3fff pages are not actually deregistered when
> all the pages'
> refcounts go to 0 -- they are just moved to the "able to be
> dereg'ed list". We don't actually dereg it until we later
> try to reg new memory and fail due to lack of resources.
> Then we take entries off the "able to be dereg'ed list" and
> dereg them, then try reg'ing the new memory again.
>
> MVAPICH: do you guys do similar things?
>
> (I don't know if HP/Scali/Intel will comment on their
> registration cache schemes)
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list