[ofa-general] Memory registration redux

Jason Gunthorpe jgunthorpe at obsidianresearch.com
Thu May 7 15:48:06 PDT 2009


On Thu, May 07, 2009 at 02:46:55PM -0700, Roland Dreier wrote:

>  > Using register/unregister exposes a race for the original case you
>  > brought up - but that race is completely unfixable without hardware
>  > support. At least it now becomes a hw specific race that can be
>  > printk'd and someday fixed in new HW rather than an unsolvable API
>  > problem..
> 
> We definitely don't want to duplicate all this logic in every hardware
> device driver, so most of it needs to be generic.  If we're adding new
> low-level driver methods to handle this, that definitely raises the cost
> of implementing all this.  But I guess if we start with a generic
> register/unregister fallback that drivers can override for better
> performance, then I think we're in good shape.

Right, I was only thinking of a new driver call that was along the
lines of update_mr_pages() that just updates the HCA's mapping with
new page table entires atomically. It really would be device
specific. If there is no call available then unregister/register +
printk log is a fair generic implementation.

To be clear, what I'm thinking is that this would only be invoked if
the VM is being *replaced*. Simply unmaping VM should do nothing.

> Which means that the MMU notifier has to walk the list of memory
> registrations and mark any affected ones as dirty (possibly with a hint
> about which pages were invalidated) as you suggest below.  Falling back
> to the "check every registration" ultra-slow-path I think should never
> ever happen.

Yikes, yes, that makes sense. And hearing that at least openmpi caps
the registration size makes me think per-page granularity is probably
unnecessary to track.

>  > Well, exactly, that's the problem. If you can't trap mmap you cannot
>  > do the initial faulting and mapping for a new object that is being
>  > mapped into an existing MR.
>  > 
>  > Consider:
>  > 
>  >   void *a = mmap(0,PAGE_SIZE..);
>  >   ibv_register();
>  >   // [..]
>  >   mmunmap(a);
>  >   ibv_synchronize();
>  > 
>  >   // At this point we want the HCA mapping to point to oblivion
>  > 
>  >   mmap(a,PAGE_SIZE,MAP_FIXED);
>  >   ibv_synchronize();
>  > 
>  >   // And now we want it to point to the new allocation
>  > 
>  > I use MAP_FIXED to illustrate the point, but Jeff has said the same
>  > address re-use happens randomly in real apps.
> 
> This can be handled I think, although at some cost.  Just have the
> kernel keep track of which MMU sequence number actually invalidated each
> MR, and return (via ibv_synchronize()) the MMU change sequence number
> that userspace is in sync with.  So in the example above, the first
> synchronize after munmap() will fail to fix up the first registration,
> since it is pointing to an unmapped virtual address, and hence it will
> leave that MR on the dirty list, and return that sequence number as not
> being synced up yet.  And then the second synchronize will see that MR
> still on the dirty list, and try again to find the pages.

I agree some kind of kernel/userspace exchange of the sequence number
is necessary to make all the locking and race conditions work out.

But the problem I'm seeing is how does the sequence number get
incremented by the kernel after the mmap() call in the above sequence?
Which mmu_notifier/etc call back do you hook for that?

The *very best* hook would be one that is called when a mm has new
virtual address space allocated and the verbs layer would then take
the allocated address range and intersect it with the registration
list. Any registrations that have pages in the allocated region are
marked invalid.

Imagine every call to ibv_synchronize was prefixed with a check that
the sequence number is changed.

>  > This method avoids the problem you noticed, but there is extra work to
>  > fixup a registration that may never be used again. I strongly suspect
>  > that in the majority of cases this extra work should be about on the
>  > same order as userspace calling unregister on the MR.
> 
> Yes, also it doesn't match the current MPI way of lazily unregistering
> things, and only garbage collecting the refcnt 0 cache entries when a
> registration fails.  With this method, if userspace unregisters
> something, it really is gone, and if it doesn't unregister it, then it
> really uses up space until userspace explicitly unregisters it.  Not
> sure how MPI implementers feel about that.

Well, mixing the lazy unregister in is not a significant change, just
don't increment the sequence number on munmap and have the kernel do
nothing until pages are mapped into an existing registration. With a
flag both behaviors are possible.

All of this work is mainly to close the hole where mapping new memory
over already registered VM results in RDMA to the wrong pages. Fixing
this hole removes the need to trap memory management syscalls and
solves that data corruption problem.

>From there various optimizations can be done, like lazy garbage
collecting registrations that no longer point to mapped memory.

>  > Or, ignore the overlapping problem, and use your original technique,
>  > slightly modified:
>  >  - Userspace registers a counter with the kernel. Kernel pins the
>  >    page, sets up mmu notifiers and increments the counter when
>  >    invalidates intersect with registrations
>  >  - Kernel maintains a linked list of registrations that have been
>  >    invalidated via mmu notifiers using the registration structure
>  >    and a dirty bit
>  >  - Userspace checks the counter at every cache hit, if different it
>  >    calls into the kernel:
>  >        MR_Cookie *mrs[100];
>  >        int rc = ibv_get_invalid_mrs(mrs,100);
>  >        invalidate_cache(mrs,rc);
>  >        // Repeat until drained
>  > 
>  >    get_invalid_mrs traverses the linked list and returns an
>  >    identifying value to userspace, which looks it up in the cache,
>  >    calls unregister and removes it from the cache.
> 
> What's the advantage of this?  I have to do the get_invalid_mrs() call a
> bunch of times, rather than just reading which ones are invalid from the
> cache directly?

This is a trade off, the above is a more normal kernel API and lets
the app get an list of changes it can scan. Having the kernel update
flags means if the app wants a list of changes it has to scan all
registrations.

Knowing the registration is no good lets you remove it from the search
list and save time on the hot path.

I imagined a call that would return as much in one go as memory is
available (ie 100 entries above) so I doubt more then one call per
event would ever be needed.

Jason



More information about the general mailing list