[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

Sat May 7 09:30:52 PDT 2005

On Sat, 7 May 2005, Timur Tabi wrote:
> 
> > Oh, well, maybe, but what is the real problem?
> > Are you sure that copy-on-write doesn't come into it?
> 
> No, but I do know that my test case doesn't call fork(), so it's reproducible
> without involving COW.  Of course, I'm sure someone's going to tell me now
> that COW comes into effect even without fork().  If so, please explain.

I'll try.  COW comes into effect whenever you're sharing a page and
then need to make private changes to it.  Fork is one way of sharing
(with ancestor and descendant processes).  Using the empty zero page
is another way of sharing (with all other processes and parts of your
own address space with a readonly page full of zeroes).  Using a file
page from the page cache is another way of sharing.

None of those is actually your case, but our test for whether a page
is shared has been inadequate: oversimplifying, if page_count is more
than 1 then we have to assume it is shared and do the copy-on-write
(if the modifications are to be private).  But there are various places
where the page_count is temporarily raised (e.g. while paging out),
which we cannot distinguish, so occasionally we'll copy on write even
when it's not necessary, but we lack the information to tell us so.

In particular, of course, get_user_pages raises page_count to pin
the page: so making a page appear shared when it's not shared at all.

> The short answer: under "extreme" memory pressure, the data inside a page
> pinned by get_user_pages() is swapped out, moved, or deleted (I'm not sure
> which).  Some other data is placed into that physical location.
> 
> By extreme memory pressure, I mean having the process allocate and touch as
> much memory as possible.  Something like this:
> 
> num_bytes = get_amount_of_physical_ram();
> char *p = malloc(num_bytes);
> for (i=0; i<num_bytes; i+=PAGE_SIZE)
> p[i] = 0;
> 
> The above over-simplified code fails on earlier 2.6 kernels (or earlier
> versions of glibc that accompany most distros the use the earlier 2.6
> kernels).  Either malloc() returns NULL, or the p[i]=0 part causes a segfault.
> I haven't bothered to trace down why.  But when it does work, the page pinned
> by get_user_pages() changes.

Which has to be a bug with get_user_pages, which has no other purpose
than to pin the pages.  I cannot criticize you for working around it
to get your app working on lots of releases, but what _we_ have to do
is fix get_user_pages - and it appears that Andrea did so a year ago.

I'm surprised if it's as simple as you describe (you do say over-
simplified, maybe the critical points have fallen out), since GUP
users would have complained long ago if it wasn't doing the job in
normal cases of memory pressure.  Andrea's case does involve the
process independently trying to touch a page it has pinned for I/O
with get_user_pages.  Or (and I've only just thought of this, suspect
it might be exactly your case) not touch, but apply get_user_pages
again to a page already so pinned (while memory pressure has caused
try_to_unmap_one temporarily to detach it from the user address space
- the aspect of the problem that Andrea's fix attacks).

> My understanding is that mlock() could in theory allow the page to be moved,
> but that currently nothing in the kernel would actually move it.  However,
> that could change in the future to allow hot-swapping of RAM.

That's my understanding too, that nothing currently does so.  Aside from
hot-swapping RAM, there's also a need to be able to migrate pages around
RAM, either to unfragment memory allowing higher-order allocations to
succeed more often, or to get around extreme dmamem/normal-mem/highmem
imbalances without dedicating huge reserves.  Those would more often
succeed if uninhibited by mlock.

> So I need to take into account distro vendors that use an earlier kernel, like
> 2.6.5, and back-port the patch from 2.6.7.  The distro vendor will keep the
> 2.6.5 version number, which is why I can't rely on it.
> 
> I need to know exactly what the fix is, so that when I scan mm/rmap.c, I know
> what to look for.  Currently, I look for this regex:
> 
> try_to_unmap_one.*vm_area_struct
> 
> which seems to work.  However, now I think it's just a coincidence.

Perhaps any release based on 2.6.7 or above, or any release which
mentions "get_user_pages" in its mm/rmap.c or mm/objrmap.c?

> > By the way, please don't be worried when soon the try_to_unmap_one
> > comment and code that you identified above disappear.  When I'm
> > back in patch submission mode, I'll be sending Andrew a patch which
> > removes it, instead reworking can_share_swap_page to rely on the
> > page_mapcount instead of page_count, which avoids the ironical
> > behaviour my comment refers to, and allows an awkward page migration
> > case to proceed (once unpinned).  Andrea and I now both prefer this
> > page_mapcount approach.
> 
> Ugh, that means my regex is probably going to break.  Not only that, but I
> don't understand what you're saying either.  Trying to understand the VM is
> really hard.

Sorry about that, but suiting your regex is low in our priorities for
VM design!  I was tempted to offer to keep a comment on get_user_pages
in mm/rmap.c after the change, but that's really rather babyish: just
assume 2.6.7 and upwards are fixed (or complain if you find not).

Perhaps I'll manage a clearer explanation when I come to write the
change description for the patch, we'll have to see.

> I guess in this specific case, it doesn't really matter, because calling
> mlock() when I should be calling get_user_pages() is not a bad thing.

If you can afford to keep that amount of memory mlocked, and have to
capability to do so, yes, it should do no harm.  We were just upset
to think that mlock was still needed to get around a get_user_pages
bug which was fixed a year ago.

Hugh