[ofa-general] Memory registration redux

Tang, Changqing changquing.tang at hp.com
Thu May 7 09:07:05 PDT 2009


HP-MPI is pretty much doing the similar thing.  --CQ
 

> -----Original Message-----
> From: general-bounces at lists.openfabrics.org 
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of 
> Jeff Squyres
> Sent: Thursday, May 07, 2009 8:54 AM
> To: Roland Dreier
> Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny 
> Verkhovsky; HÃ¥kon Bugge; Donald Kerr; OpenFabrics General; 
> Alexander Supalov
> Subject: Re: [ofa-general] Memory registration redux
> 
> On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
> 
> > By the way, what's the desired behavior of the cache if a process 
> > registers, say, address range 0x1000 ... 0x3fff, and then the same 
> > process registers address range 0x2000 ... 0x2fff (with all 
> the same 
> > permissions, etc)?
> >
> > The initial registration creates an MR that is still valid for the 
> > smaller virtual address range, so the second registration is much 
> > cheaper if we used the cached registration; but if we use the cache 
> > for the second registration, and then deregister the first 
> one, we're 
> > stuck with a too-big range pinned in the cache because of 
> the second 
> > registration.
> >
> 
> 
> I don't know what the other MPI's do in this scenario, but 
> here's what OMPI will do:
> 
> 1. lookup 0x1000-0x3fff in the cache; not find any of it it, 
> and therefore register
>     - add each page to our cache with a refcount of 1 2. 
> lookup 0x2000-0x2fff in the cache, find that all the pages 
> are already registered
>     - refcount++ on each page in the cache 3. when we go to 
> dereg 0x1000-0x3fff
>     - refcount-- on each page in the cache
>     - since some pages in the range still have refcount>0, 
> don't do anything further
> 
> Specifically: the actual dereg of 0x1000-0x3fff is blocked on 
> also releasing 0x2000-0x2fff.
> 
> Note that OMPI will only register a max of X bytes at a time 
> (where X defaults to 2MB).  So even if a user calls 
> MPI_SEND(...) with an enormous buffer, we'll register it 
> X/page_size pages at a time, not the entire buffer at once.  
> Hence, the "buffer A is blocked from dereg'ing by buffer B" 
> scenario is *somewhat* mitigated -- it's less wasteful than 
> if we can registered/cached the entire huge buffer at once.
> 
> Finally, note that if 0x2000-0x2fff had not been registered, 
> the 0x1000-0x3fff pages are not actually deregistered when 
> all the pages'  
> refcounts go to 0 -- they are just moved to the "able to be 
> dereg'ed list".  We don't actually dereg it until we later 
> try to reg new memory and fail due to lack of resources.  
> Then we take entries off the "able to be dereg'ed list" and 
> dereg them, then try reg'ing the new memory again.
> 
> MVAPICH: do you guys do similar things?
> 
> (I don't know if HP/Scali/Intel will comment on their 
> registration cache schemes)
> 
> --
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


More information about the general mailing list