[openib-general] [hugh at veritas.com: Re: Nick's core remove PageReserved broke vmware...]

Gleb Natapov glebn at voltaire.com
Thu Nov 3 06:19:24 PST 2005


Hello Michael,

It seems that it is time to resurrect your DONTCOPY patch. Can you do it?
If you have no time now I can handle it.

----- Forwarded message from Hugh Dickins <hugh at veritas.com> -----

From: Hugh Dickins <hugh at veritas.com>
To: Gleb Natapov <gleb at minantech.com>
Cc: Benjamin Herrenschmidt <benh at kernel.crashing.org>,
	Petr Vandrovec <vandrove at vc.cvut.cz>,
	Nick Piggin <nickpiggin at yahoo.com.au>,
	"Michael S. Tsirkin" <mst at mellanox.co.il>,
	Badari Pulavarty <pbadari at us.ibm.com>,
	Linux Kernel Mailing List <linux-kernel at vger.kernel.org>
Subject: Re: Nick's core remove PageReserved broke vmware...
Date: Thu, 3 Nov 2005 14:11:46 +0000 (GMT)

On Thu, 3 Nov 2005, Gleb Natapov wrote:
> On Wed, Nov 02, 2005 at 10:02:49PM +0000, Hugh Dickins wrote:
> > On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote:
> > > On Wed, 2005-11-02 at 21:41 +0000, Hugh Dickins wrote:
> > > 
> > > > The only extant problem here is if the pages are private, and you
> > > > fork while this is going on, and the parent user process writes to the
> > > > area before completion: then COW leaves the child with the page being
> > > > DMAed into, giving the parent a copied page which may be incomplete.
> > > 
> > > Won't happen, and if it does, it's a user error to rely on that working,
> > > so it doesn't matter.
> > 
> > I wish everyone else would see it that way!  (But some people do
> > have valid scenarios where it can't just be ruled out completely.)
> > 
> I am one of those people :)
> 
> Last discussion about this issue ended without resolution, but I remember
> you mentioned the possibility to leave ptes writable in parent during fork 
> for private pages mapped for DMA. Is this approach acceptable?

I was toying with that idea back then, but it leaves the pages in a
peculiar limbo between being shared and private, such that it's hard
to think through the consequences.  We do already have a case rather
like that (ptrace writing to a write-protected area), but some of us
are a bit worried by that one, so I'd be foolish now to recommend
another such subversion of the rules.

In the time since we discussed before, I've rather come full circle
round to my original position: abandoning such ideas of trying to
handle it from get_user_pages itself, appreciating the simplicity
of the original PROT_DONTCOPY idea from you guys; but sticking to my
initial reaction that this is better done by madvise(MADV_DONTCOPY),
not by the mmap/mprotect route in Michael's patch.  (I never bought
the "racy" argument advanced in favour of the mmap flag.)

One of the factors which has swayed me to the DONTCOPY approach, is
Nick's 2.6.14 optimization in fork's copy_page_range, where areas
which can be safely faulted later are not copied pte by pte.  But
that doesn't apply to all areas, and in particular cannot apply to
VM_NONLINEAR shared areas.  It should be of benefit to apps which
use large such areas, and also do a lot of forking children who don't
need those areas, to be able to mark them VM_DONTCOPY.  Or any other
vmas the children won't need.  (But there's one big distinction between
the optimization and VM_DONTCOPY: the optimization copies vma but
doesn't fill in its ptes, VM_DONTCOPY doesn't even copy the vma.)

Two warnings if someone would like to post a MADV_DONTCOPY patch.
It should include a matching MADV_DOCOPY to clear the condition, but
that must not be allowed to clear VM_DONTCOPY set originally by driver:
perhaps you'll end up with a VM_UDONTCOPY or something like that.

And Badari has a MADV_REMOVE patch in the works, taking the next
slot (just after MADV_DONTNEED in most of the arches): probably
best for you to base yours on top of his (though yours is simpler
and might jump ahead).

Hugh

----- End forwarded message -----

--
			Gleb.



More information about the general mailing list