[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

Sat May 7 06:18:08 PDT 2005

Sorry for not replying earlier (indeed, sorry for not joining in the
wider RDMA pinning discussion), concentrating on other stuff at present.

On Fri, 6 May 2005, Timur Tabi wrote:
> Timur Tabi wrote:
> 
> > I haven't gotten a reply to this question, but I've done my own research,
> > and I think I found the answer.  Using my own test of get_user_pages(),
> > it appears that the fix was placed in 2.6.7.  However, I would like to
> > know specifically what the fix is. Unfortunately, tracking this stuff
> > down is beyond my understanding of the Linux VM.
> 
> I'm also still waiting for a reply to this question. Anyone????
> 
> Upon doing some more research, I think the fix might be those code instead:

I believe you're right this time - I was rather puzzled by your earlier
choice, then unhelpfully forgot to reply and point you a few lines further
down to this comment, which does shout "get_user_pages fix" quite loudly.

> /*
> * Don't pull an anonymous page out from under get_user_pages.
> * GUP carefully breaks COW and raises page count (while holding
> * page_table_lock, as we have here) to make sure that the page
> * cannot be freed.  If we unmap that page here, a user write
> * access to the virtual address will bring back the page, but
> * its raised count will (ironically) be taken to mean it's not
> * an exclusive swap page, do_wp_page will replace it by a copy
> * page, and the user never get to see the data GUP was holding
> * the original page for.
> */
> if (PageSwapCache(page) &&
> page_count(page) != page->mapcount + 2) {
> ret = SWAP_FAIL;
> goto out_unmap;
> }
> 
> Both this change and the other one I mentioned are new to 2.6.7.  I suppose I
> could try applying these patches to the 2.6.6 kernel and see if anything
> improves, but that won't help me understand what's really going on.

There's a lot of change in the rmap area between 2.6.6 and 2.6.7, but
you're right that this is an isolated fix, which could in principle be
applied to earlier releases.  Though I don't see it's worth doing now.

> The above comment makes sounds almost like it's a fix,

Almost?  Sorry if my comment doesn't make it obvious it's a fix for a
get_user_pages issue - I rewrote Andrea Arcangeli's original commment.
The analysis and fix are his.

> but it talks about copy-on-write,
> which is has nothing to do with the real problem.

Oh, well, maybe, but what is the real problem?
Are you sure that copy-on-write doesn't come into it?

I haven't reread through the whole thread, but my recollection is
that you never quite said what the real problem is: you'd found some
time ago that get_user_pages sometimes failed to pin the pages for
your complex app, so were forced to mlock too; but couldn't provide
any simple test case for the failure (which can indeed be a lot of
work to devise), so we were all in the dark as to what went wrong.

But you've now found that 2.6.7 and later kernels allow your app to
work correctly without mlock, good.  get_user_pages is certainly the
right tool to use for such pinning.  (On the question of whether
mlock guarantees that user virtual addresses map to the same physical
addresses, I prefer Arjan's view that it does not; but accept that
there might prove to be difficulties in holding that position.)

So, it works now, you've exonerated today's get_user_pages, and you've
identified at least one get_user_pages fix which went in at that time:
do we really need to chase this further?

Oh, in writing of copy-on-write, I've just remembered another fix
for get_user_pages which I made in 2.6.7 (though I've not heard of
anyone seeing the problem fixed): call to do_wp_page in do_swap_page.
get_user_pages assumes that the write fault it generates will break
copy-on-write i.e. will make a private copy page when necessary,
before returning to the caller; but that wasn't happening in the
do_swap_page case.

By the way, please don't be worried when soon the try_to_unmap_one
comment and code that you identified above disappear.  When I'm
back in patch submission mode, I'll be sending Andrew a patch which
removes it, instead reworking can_share_swap_page to rely on the
page_mapcount instead of page_count, which avoids the ironical
behaviour my comment refers to, and allows an awkward page migration
case to proceed (once unpinned).  Andrea and I now both prefer this
page_mapcount approach.

Hugh