[ofa-general] Re: [patch] my mmu notifiers

Andrea Arcangeli andrea at qumranet.com
Tue Feb 19 16:52:06 PST 2008


On Wed, Feb 20, 2008 at 12:04:27AM +0100, Nick Piggin wrote:
> On Tue, Feb 19, 2008 at 08:27:25AM -0600, Jack Steiner wrote:
> > > On Tue, Feb 19, 2008 at 02:58:51PM +0100, Andrea Arcangeli wrote:
> > > > understand the need for invalidate_begin/invalidate_end pairs at all.
> > > 
> > > The need of the pairs is crystal clear to me: range_begin is needed
> > > for GRU _but_only_if_ range_end is called after releasing the
> > > reference that the VM holds on the page. _begin will flush the GRU tlb
> > > and at the same time it will take a mutex that will block further GRU
> > > tlb-miss-interrupts (no idea how they manange those nightmare locking,
> > > I didn't even try to add more locking to KVM and I get away with the
> > > fact KVM takes the pin on the page itself).
> > 
> > As it turns out, no actual mutex is required. _begin_ simply increments a
> > count of active range invalidates, _end_ decrements the count. New TLB
> > dropins are deferred while range callouts are active.
> > 
> > This would appear to be racy but the GRU has special hardware that
> > simplifies locking. When the GRU sees a TLB invalidate, all outstanding
> > misses & potentially inflight TLB dropins are marked by the GRU with a
> > "kill" bit. When the dropin finally occurs, the dropin is ignored & the
> > instruction is simply restarted. The instruction will fault again & the TLB
> > dropin will be repeated.  This is optimized for the case where invalidates
> > are rare - true for users of the GRU.
> 
> OK (thanks to Robin as well). Now I understand why you are using it,
> but I don't understand why you don't defer new TLBs after the point
> where the linux pte changes. If you can do that, then you look and
> act much more like a TLB from the point of view of the Linux vm.

Christoph was forced to put the invalidate_range callback _after_
dropping the PT lock because xpmem has to wait I/O there. But
invalidate_range is called after freeing the VM reference on the pages
so then GRU needed a _range_begin too because GRU has to flush the tlb
before the VM reference on the page is released (xpmem and KVM pin the
pages mapped by the secondary mmu, GRU doesn't). So then
invalidate_range was renamed to invalidate_range_end.



More information about the general mailing list