[ofa-general] Memory registration redux

Matthew Koop koop at cse.ohio-state.edu
Thu May 7 09:55:13 PDT 2009


MVAPICH is also doing pretty much the same thing as well.

Matt

On Thu, 7 May 2009, Tang, Changqing wrote:

>
> HP-MPI is pretty much doing the similar thing.  --CQ
>
>
> > -----Original Message-----
> > From: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
> > Jeff Squyres
> > Sent: Thursday, May 07, 2009 8:54 AM
> > To: Roland Dreier
> > Cc: Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny
> > Verkhovsky; Håkon Bugge; Donald Kerr; OpenFabrics General;
> > Alexander Supalov
> > Subject: Re: [ofa-general] Memory registration redux
> >
> > On May 6, 2009, at 4:10 PM, Roland Dreier (rdreier) wrote:
> >
> > > By the way, what's the desired behavior of the cache if a process
> > > registers, say, address range 0x1000 ... 0x3fff, and then the same
> > > process registers address range 0x2000 ... 0x2fff (with all
> > the same
> > > permissions, etc)?
> > >
> > > The initial registration creates an MR that is still valid for the
> > > smaller virtual address range, so the second registration is much
> > > cheaper if we used the cached registration; but if we use the cache
> > > for the second registration, and then deregister the first
> > one, we're
> > > stuck with a too-big range pinned in the cache because of
> > the second
> > > registration.
> > >
> >
> >
> > I don't know what the other MPI's do in this scenario, but
> > here's what OMPI will do:
> >
> > 1. lookup 0x1000-0x3fff in the cache; not find any of it it,
> > and therefore register
> >     - add each page to our cache with a refcount of 1 2.
> > lookup 0x2000-0x2fff in the cache, find that all the pages
> > are already registered
> >     - refcount++ on each page in the cache 3. when we go to
> > dereg 0x1000-0x3fff
> >     - refcount-- on each page in the cache
> >     - since some pages in the range still have refcount>0,
> > don't do anything further
> >
> > Specifically: the actual dereg of 0x1000-0x3fff is blocked on
> > also releasing 0x2000-0x2fff.
> >
> > Note that OMPI will only register a max of X bytes at a time
> > (where X defaults to 2MB).  So even if a user calls
> > MPI_SEND(...) with an enormous buffer, we'll register it
> > X/page_size pages at a time, not the entire buffer at once.
> > Hence, the "buffer A is blocked from dereg'ing by buffer B"
> > scenario is *somewhat* mitigated -- it's less wasteful than
> > if we can registered/cached the entire huge buffer at once.
> >
> > Finally, note that if 0x2000-0x2fff had not been registered,
> > the 0x1000-0x3fff pages are not actually deregistered when
> > all the pages'
> > refcounts go to 0 -- they are just moved to the "able to be
> > dereg'ed list".  We don't actually dereg it until we later
> > try to reg new memory and fail due to lack of resources.
> > Then we take entries off the "able to be dereg'ed list" and
> > dereg them, then try reg'ing the new memory again.
> >
> > MVAPICH: do you guys do similar things?
> >
> > (I don't know if HP/Scali/Intel will comment on their
> > registration cache schemes)
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>




More information about the general mailing list