[openib-general] Question about pinning memory

Wed Jul 27 00:02:43 PDT 2005

On Tue, Jul 26, 2005 at 09:43:44PM -0700, Roland Dreier wrote:
>     Gleb> This is what Pete did. We are back to on syscall on each
>     Gleb> registration.  And the worst case is three syscalls. First
>     Gleb> to check that mapping is valid, second to deregister memory
>     Gleb> if it is not and third register new memory.
> 
> I don't think this is an accurate description of Pete's work.  As I
> understand it, the fast path (no VM operations have occurred since
> registration was done) does not involve any system calls -- the MPI
> library just checks the queue of VM events (which will likely be in
> cache already), sees that it is empty, and proceeds with the IO.
> 
Then we understand Pete differently. This is what he wrote in his first
email on the subject:
   The MPI library essentially makes a system call before
   reusing a cached memory registration to verify it is still valid.
How "the MPI library just checks the queue of VM events" without system
call? Do you mean kernel dumps events in some preallocated memory in
user process address space?

> I was just trying to offer a small refinement so the complex code to
> handle the very rare case of queue overflow is not needed.
> 
> Of course if memory needs to be reregistered, then we hit the slow
> patch (just like a cache miss for the registration cache).
> 
>     Gleb> Not exactly. When user frees memory by calling free() the
>     Gleb> memory is not unmapped from process address space by
>     Gleb> libc. We want to cache the registrations in this memory. We
>     Gleb> don't want to cache registration across mmap/munmap/mmap (it
>     Gleb> is much harder to do if at all possible).  When libc unmaps
>     Gleb> memory by sbrk() or munmap the memory guaranteed to be not
>     Gleb> used by correct program so it is safe to deregister it if we
>     Gleb> catch this event.
> 
> The first statement in this paragraph is false.  It's easy to strace a
> simple program that does something like
> 
> 	x = malloc(1000000);
> 	free(x);
> 
> and watch glibc do an mmap(... MAP_ANONYMOUS ...) followed by
> munmap().  Even for smaller allocations, glibc may use sbrk() to
> shrink the heap in free().  You can read about M_TRIM_THRESHOLD and so
> on in the mallopt() documentation.
> 
You are nitpicking. I know about mallopt() options, thank you. This is not
invalidates the paragraph though. We remove cached entry after libc does
munmap(). In fact that is what paragraph states if you read it till the
end.

> This means that it is entirely possible for a correct program to have
> different physical memory at the same virtual address, even if it does
> nothing but malloc() and free().  Pete's work allows caching of all
> registrations, including handling mmap() and everything else.
> 
Of cause it is possible. It seems you misunderstand me. What is not
possible is that this happens between malloc() and free() for pinned
memory. After user does free() on a chunk of pinned memory we do not
unpin memory till libc returns this memory to the kernel via
munmap()/sbrk() system calls. Again this is what I already wrote.

The solution that I what to see is this:
1) Program knows somehow (without system call) that vma list changes.
2) Program fetch new vma list (possibly by using system call).
3) Program unregister cached entries that are no longer in virtual address
space.

The 1 may be achieved by catching munmap()/sbrk() or by kernel sending
signal when vma list changes or something else. But NOT by read() on
some fd since this require system call on each cache hit or separate
thread to do blocking read(fd) and then you have synchronisation
problems.

If we catch munmap()/sbrk() step 2 is not required. This is even beter.

--
			Gleb.