[ofa-general] Memory registration redux

Jason Gunthorpe jgunthorpe at obsidianresearch.com
Wed May 6 17:02:31 PDT 2009


On Wed, May 06, 2009 at 03:39:54PM -0700, Roland Dreier wrote:
>  > Well, this conceptually doesn't seem hard. Go through all the pages in
>  > the MR, if any have changed then pin the new page and replace the
>  > pages physical address in the HCA's page table. Once done, synchronize
>  > with the hardware, then run through again and un-pin and release all
>  > the replaced pages.
>  > 
>  > Every HCA must have the necessary primitives for this to support
>  > register and unregister...
> 
> No... every HCA just needs to support register and unregister.  It
> doesn't have to support changing the mapping without full unregister and
> reregister.

Well, I would imagine this entire process to be a HCA specific
operation, so HW that supports a better method can use it, otherwise
it has to register/unregister. Is this a concern today with existing
HCAs?

Using register/unregister exposes a race for the original case you
brought up - but that race is completely unfixable without hardware
support. At least it now becomes a hw specific race that can be
printk'd and someday fixed in new HW rather than an unsolvable API
problem..

> Also this requires potentially walking the page tables of the entire
> process, checking to see if any mappings have changed.  We really want
> to keep the information that the MMU notifiers give us, namely which
> virtual address range is changing.

Walking the page tables of every registration in the process, not the
entire process.

>  > The mmu notifiers can optionally make note of the affected pages
>  > during the callback to reduce the workload of the syscall.
 
> This requires an unbounded amount of events to be queued up in the
> kernel, naively.  (If we lose some events then we have to go back to the
> full page table scan, which I don't think is feasible)

I was thinking more along the lines of having the mmu notifiers put
affected registrations on a per-process (or PD?) dirty linked list,
with the link pointers as part of the registration structure. Set a
dirty flag in the registration too. An extra pointer per registration
and a minor incremental cost to the existing work the mmu notifier
would have to do.

>  > Only part I don't immediately see is how to trap creation of new VM
>  > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..
> 
> Why do we care?  The initial faulting in of mappings occurs when an MR
> is created.

Well, exactly, that's the problem. If you can't trap mmap you cannot
do the initial faulting and mapping for a new object that is being
mapped into an existing MR.

Consider:

  void *a = mmap(0,PAGE_SIZE..);
  ibv_register();
  // [..]
  mmunmap(a);
  ibv_synchronize();

  // At this point we want the HCA mapping to point to oblivion

  mmap(a,PAGE_SIZE,MAP_FIXED);
  ibv_synchronize();

  // And now we want it to point to the new allocation

I use MAP_FIXED to illustrate the point, but Jeff has said the same
address re-use happens randomly in real apps.

This is the main deviation from your original idea, instead of having
a granular notification to userspace to unregister a region, the
kernel just goes and fixes it up so the existing registration still
works.

This method avoids the problem you noticed, but there is extra work to
fixup a registration that may never be used again. I strongly suspect
that in the majority of cases this extra work should be about on the
same order as userspace calling unregister on the MR.

Or, ignore the overlapping problem, and use your original technique,
slightly modified:
 - Userspace registers a counter with the kernel. Kernel pins the
   page, sets up mmu notifiers and increments the counter when
   invalidates intersect with registrations
 - Kernel maintains a linked list of registrations that have been
   invalidated via mmu notifiers using the registration structure
   and a dirty bit
 - Userspace checks the counter at every cache hit, if different it
   calls into the kernel:
       MR_Cookie *mrs[100];
       int rc = ibv_get_invalid_mrs(mrs,100);
       invalidate_cache(mrs,rc);
       // Repeat until drained

   get_invalid_mrs traverses the linked list and returns an
   identifying value to userspace, which looks it up in the cache,
   calls unregister and removes it from the cache.

Jason



More information about the general mailing list