[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

Mon Apr 18 09:22:29 PDT 2005

Andrew Morton wrote:
> Roland Dreier <roland at topspin.com> wrote:
> 
>>    Troy> Do we even need the mlock in userspace then?
>>
>>Yes, because the kernel may go through and unmap pages from userspace
>>while trying to swap.  Since we have the page locked in the kernel,
>>the physical page won't go anywhere, but userspace might end up with a
>>different page mapped at the same virtual address.
> 
> 
> That shouldn't happen.  If get_user_pages() has elevated the refcount on a
> page then the following can happen:
> 
> - The VM may decide to add the page to swapcache (if it's not mmapped
>   from a file).
> 
> - Once the page is backed by either swapcache of a (mmapped) file, the VM
>   may decide the unmap the application's pte's.  A later minor fault by the
>   app will cause the same physical page to be remapped.

That's not what we're seeing.  We have hardware that does DMA over the network (much like 
the Infiniband stuff), and we have a testcase that fails if get_user_pages() is used, but 
not if mlock() is used.  Consider two computers on a network, X and Y.  Both have our 
hardware, which can transfer a page of memory from a given physical address on X to a 
physical address on Y.

1) Application on X allocates a block of memory, and passes the virtual address to the driver.
2) Driver on X calls get_user_pages() and then obtains a physical address for the memory.
3) Application and driver on Y do the same thing.
4) App X fills memory with some data D.
5) App X then allocates as much memory as it possibly can.  It touches every page in this 
memory, and then frees the memory.  This will force other pages to be swapped out, 
including the supposedly pinned memory.
6) App X then tells Driver X to transfer data D to computer Y.
7) App Y compares data D and finds that it doesn't match with it's supposed to.

Conclusion: during step 5, the data in pinned memory is swapped out or something.  I'm not 
sure where it goes.

We can only demonstrate this problem using our hardware, because you need the ability to 
transfer memory without using the CPU.  We were going to prepare a test case and ship same 
hardware to a few kernel developers to prove our point, but now that we're able to call 
mlock() in non-user processes, we decided it wasn't worth our time.  Actually, I 
discovered that I can call cap_raise() and set the ulimit structure, which gives me the 
ability to call mlock() on any amount of memory from any process in 2.4 and 2.6 kernels, 
which we need to support.  If I had thought of that earlier, I wouldn't have needed all 
those hacks to call sys_mlock() from the driver.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi at ammasso.com