[ofa-general] Memory registration redux

Thu May 7 14:46:55 PDT 2009

 > > No... every HCA just needs to support register and unregister.  It
 > > doesn't have to support changing the mapping without full unregister and
 > > reregister.
 > 
 > Well, I would imagine this entire process to be a HCA specific
 > operation, so HW that supports a better method can use it, otherwise
 > it has to register/unregister. Is this a concern today with existing
 > HCAs?
 > 
 > Using register/unregister exposes a race for the original case you
 > brought up - but that race is completely unfixable without hardware
 > support. At least it now becomes a hw specific race that can be
 > printk'd and someday fixed in new HW rather than an unsolvable API
 > problem..

We definitely don't want to duplicate all this logic in every hardware
device driver, so most of it needs to be generic.  If we're adding new
low-level driver methods to handle this, that definitely raises the cost
of implementing all this.  But I guess if we start with a generic
register/unregister fallback that drivers can override for better
performance, then I think we're in good shape.

 > > Also this requires potentially walking the page tables of the entire
 > > process, checking to see if any mappings have changed.  We really want
 > > to keep the information that the MMU notifiers give us, namely which
 > > virtual address range is changing.
 > 
 > Walking the page tables of every registration in the process, not the
 > entire process.

Yes... but there are bugs in the bugzilla about mthca being limited to
only 8 GB of registration by default or something like that, and having
that break Intel MPI in some cases.  So some MPI jobs want to have 10s
of GBs of registered memory -- walking millions of page table entries
for every resync operation seems like a big problem to me.

Which means that the MMU notifier has to walk the list of memory
registrations and mark any affected ones as dirty (possibly with a hint
about which pages were invalidated) as you suggest below.  Falling back
to the "check every registration" ultra-slow-path I think should never
ever happen.

 > I was thinking more along the lines of having the mmu notifiers put
 > affected registrations on a per-process (or PD?) dirty linked list,
 > with the link pointers as part of the registration structure. Set a
 > dirty flag in the registration too. An extra pointer per registration
 > and a minor incremental cost to the existing work the mmu notifier
 > would have to do.

Yes, makes sense.

 > >  > Only part I don't immediately see is how to trap creation of new VM
 > >  > (ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..
 > > 
 > > Why do we care?  The initial faulting in of mappings occurs when an MR
 > > is created.
 > 
 > Well, exactly, that's the problem. If you can't trap mmap you cannot
 > do the initial faulting and mapping for a new object that is being
 > mapped into an existing MR.
 > 
 > Consider:
 > 
 >   void *a = mmap(0,PAGE_SIZE..);
 >   ibv_register();
 >   // [..]
 >   mmunmap(a);
 >   ibv_synchronize();
 > 
 >   // At this point we want the HCA mapping to point to oblivion
 > 
 >   mmap(a,PAGE_SIZE,MAP_FIXED);
 >   ibv_synchronize();
 > 
 >   // And now we want it to point to the new allocation
 > 
 > I use MAP_FIXED to illustrate the point, but Jeff has said the same
 > address re-use happens randomly in real apps.

This can be handled I think, although at some cost.  Just have the
kernel keep track of which MMU sequence number actually invalidated each
MR, and return (via ibv_synchronize()) the MMU change sequence number
that userspace is in sync with.  So in the example above, the first
synchronize after munmap() will fail to fix up the first registration,
since it is pointing to an unmapped virtual address, and hence it will
leave that MR on the dirty list, and return that sequence number as not
being synced up yet.  And then the second synchronize will see that MR
still on the dirty list, and try again to find the pages.

Passing the sequence number back to userspace makes it possible for
userspace to know that it still has to call ibv_synchronize() again.

There is the possibility that a 1GB MR will have its last page unmapped,
and end up having 100s of thousands of pages walked again and again in
every synchronize operation.

 > This method avoids the problem you noticed, but there is extra work to
 > fixup a registration that may never be used again. I strongly suspect
 > that in the majority of cases this extra work should be about on the
 > same order as userspace calling unregister on the MR.

Yes, also it doesn't match the current MPI way of lazily unregistering
things, and only garbage collecting the refcnt 0 cache entries when a
registration fails.  With this method, if userspace unregisters
something, it really is gone, and if it doesn't unregister it, then it
really uses up space until userspace explicitly unregisters it.  Not
sure how MPI implementers feel about that.

 > Or, ignore the overlapping problem, and use your original technique,
 > slightly modified:
 >  - Userspace registers a counter with the kernel. Kernel pins the
 >    page, sets up mmu notifiers and increments the counter when
 >    invalidates intersect with registrations
 >  - Kernel maintains a linked list of registrations that have been
 >    invalidated via mmu notifiers using the registration structure
 >    and a dirty bit
 >  - Userspace checks the counter at every cache hit, if different it
 >    calls into the kernel:
 >        MR_Cookie *mrs[100];
 >        int rc = ibv_get_invalid_mrs(mrs,100);
 >        invalidate_cache(mrs,rc);
 >        // Repeat until drained
 > 
 >    get_invalid_mrs traverses the linked list and returns an
 >    identifying value to userspace, which looks it up in the cache,
 >    calls unregister and removes it from the cache.

What's the advantage of this?  I have to do the get_invalid_mrs() call a
bunch of times, rather than just reading which ones are invalid from the
cache directly?

 - R.