[ofa-general] Re: [PATCH] mmu notifiers #v8

Nick Piggin npiggin at suse.de
Mon Mar 3 10:45:17 PST 2008


On Mon, Mar 03, 2008 at 12:06:05PM -0600, Jack Steiner wrote:
> On Mon, Mar 03, 2008 at 05:59:10PM +0100, Nick Piggin wrote:
> > > Maintaining a long-term reference on a page is a problem. The GRU does not
> > > currently maintain tables to track the pages for which dropins have been done.
> > > 
> > > The GRU has a large internal TLB and is designed to reference up to 8PB of
> > > memory. The size of the tables to track this many referenced pages would be
> > > a problem (at best).
> > 
> > Is it any worse a problem than the pagetables of the processes which have
> > their virtual memory exported to GRU? AFAIKS, no; it is on the same
> > magnitude of difficulty. So you could do it without introducing any
> > fundamental problem (memory usage might be increased by some constant
> > factor, but I think we can cope with that in order to make the core patch
> > really nice and simple).
> 
> Functionally, the GRU is very close to what I would consider to be the
> "standard TLB" model. Dropins and flushs map closely to processor dropins
> and flushes for cpus.  The internal structure of the GRU TLB is identical to
> the TLB of existing cpus.  Requiring the GRU driver to track dropins with
> long term page references seems to me a deviation from having the basic
> mmuops support a "standard TLB" model. AFAIK, no other processor requires
> this.

That is because the CPU TLBs have the mmu_gather batching APIs which
avoid the problem. It would be possible to do something similar for
GRU which would involve taking a reference for each page-to-be-invalidated
in invalidate_page, and release them when you invalidate_range. Or else
do some other scheme which makes mmu notifiers work similarly to the
mmu gather API. But not just go an invent something completely different
in the form of this invalidate_begin,clear linux pte,invalidate_end API.


> Tracking TLB dropins (and long term page references) could be done but it
> adds significant complexity and scaling issues. The size of the tables to
> track many TB (to PB) of memory can get large. If the memory is being
> referenced by highly threaded applications, then the problem becomes even
> more complex. Either tables must be replicated per-thread (and require even
> more memory), or the table structure becomes even more complex to deal with
> node locality, cacheline bouncing, etc.

I don't think it would be that significant in terms of complexity or
scaling.

For a quick solution, you could stick a radix tree in each of your mmu
notifiers registered (ie. one per mm), which is indexed on virtual address
>> PAGE_SHIFT, and returns the struct page *. Size is no different than
page tables, and locking is pretty scalable.

After that, I would really like to see whether the numbers justify
larger changes.




More information about the general mailing list