[openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

Roland Dreier roland at topspin.com
Mon Apr 25 06:15:10 PDT 2005


    Timur> With mlock(), we don't need to use get_user_pages() at all.
    Timur> Arjan tells me the only time an mlocked page can move is
    Timur> with hot (un)plug of memory, but that isn't supported on
    Timur> the systems that we support.  We actually prefer mlock()
    Timur> over get_user_pages(), because if the process dies, the
    Timur> locks automatically go away too.

There actually is another way pages can move, with both
get_user_pages() and mlock(): copy-on-write after a fork().  If
userspace does a fork(), then all PTEs are marked read-only, and if
the original process touches the page after the fork(), a new page
will be allocated and mapped at the original virtual address.

This is actually a pretty big pain, because the only good solution
seems to be for the kernel to mark these registered regions as
VM_DONTCOPY.  Right now this means that driver code ends up monkeying
with vm_flags for user vmas.

Does it seem reasonable to add a new system call to let userspace mark
memory it doesn't want copied into forked processes?  Something like

	long sys_mark_nocopy(unsigned long addr, size_t len, int mark)

which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0.
A better name would be gratefully accepted...

Then to register memory for RDMA, userspace would call
sys_mark_nocopy() (with appropriate accounting to handle possibly
overlapping regions) and the kernel would call get_user_pages().  The
get_user_pages() is of course required because the kernel can't trust
userspace to keep the pages locked.  mlock() would no longer be
necessary.  We can trust userspace to call sys_mark_nocopy() as
needed, because a process can only hurt itself and its children by
misusing the sys_mark_nocopy() call.

If this seems reasonable then I can code a patch.

 - R.



More information about the general mailing list