[ofa-general] Memory registration redux

Wed May 6 13:08:31 PDT 2009

 > Roland and I chatted on the phone today; I think I now understand
 > Roland's counter-proposal (I clearly didn't before).  Let me try to
 > summarize:
 > 
 > 1. Add a new verb for "set this userspace flag to 1 if mr X ever
 > becomes invalid"
 > 2. Add a new verb for "no longer tell me if mr X ever becomes invalid"
 > (i.e., remove the effects of #1)
 > 3. Add run-time query indicating whether #1 works
 > 4. Add [optional] memory registration caching to libibverbs

Looking closer at how to actually implement this, I see that the MMU
notifiers (cf <linux/mmu_notifier.h>) may be called with locks held, so
the kernel can't do a put_user() or the equivalent from the notifier.
Therefore I think the interface we would expose to userspace would be
something more like mmap() on some special file to get some kernel
memory mapped into userspace, and then ioctl() to register/unregister a
"set this flag if address range X...Y is affected."

To be honest I don't really love this idea -- the kernel still needs a
fairly complicated data structure to efficiently track the address
ranges being tracked, the size of the mmap() limits the number of ranges
being tracked based on a static limit set at initialization time (or
handling multiple maps gets still more complex), and there is some
careful thinking required to make sure there are no memory ordering or
cache aliasing issues.

So then I thought some about how to implement the full MR cache in the
kernel.  And that fairly quickly gets into some complex stuff as well --
for example, since we can't take sleeping locks from MMU notifiers, but
we can't hold non-sleeping locks across MR register operations, we need
to drop our MR cache lock while registering things, which forces us to
deal with rolling back registrations if we miss the cache initially but
then find that another thread has already added a registration to the
cache while we were trying to register the same memory.  Keeping the
actual MR caching in userspace does seem to make things simpler because
the locking is much easier without having to worry about sleeping
vs. non-sleeping locks.

Also doing the cache in userspace with my flag idea above has the nice
property that the fast path of hitting the cache on memory registration
has no system call and in fact testing the flag may even be a CPU cache
hit if memory registration is a hot enough path.  Doing it in the kernel
means even the best case has a system call -- which is very cheap with
current CPUs but still a non-zero cost.

So I'm really not sure what the right way to go is yet.  Further
opinions would be helpful.

 - R.