[ofa-general] New proposal for memory management

Fri May 1 10:09:47 PDT 2009

 > You mentioned that doing this stuff is a choice; the choice that
 > MPI's/ ULPs/applications therefore have is:
 >
 > - don't use registration caches/memory allocation hooking, have
 > terrible performance
 > - use registration caches/memory allocation hooking, have good
 > performance

I think it's a bit of a stretch to suggest that all or even most
userspace RDMA applications have the same need for registration caching
as MPI.  In fact my feeling is that the fact that MPI must deal with
RDMA to arbitrary memory allocated by an application out of MPI's
control is the exception.  My most recent experience was with Cisco's
RAB library, and in that case we simply designed the library so that all
RDMA was done to memory allocated by the library -- so no need for a
registration cache, and in fact no need for registration in any fast
path.  I suspect that the majority of code written to use RDMA natively
will be designed with similar properties.

So this proposal is very much an MPI-specific interface.  Which leads to
my next point.  I have no doubt that the MPI community has a very good
idea of a memory registration interface that would make MPI
implementations simpler and more robust.  However I don't think there's
quite as much expertise about what the best way to implement such an
interface is.

My initial reaction is that I don't want to extend the kernel ABI with
a set of new MPI-specific verbs if there's a way around it.  We've been
told over and over that the registration cache is complex and fragile
code -- but moving complex and fragile code into the kernel doesn't
magically make it any simpler or more robust, it just means that bugs
now crash the whole system instead of just affecting one process.

Now, of course MMU notifiers allow the kernel to know reliably when a
process's page tables change, which means that all the complicated
malloc hooking etc is not needed.  So that complexity is avoided in the
kernel.  But suppose I give userspace the same MMU notifier capability
(eg I add a system call like "if any mappings in the virtual address
range X ... Y change, then write a 1 to virtual address Z") -- then what
do I gain from having the rest of the registration caching in the
kernel?  (And avoiding the duplication of caching code between multiple
MPI implementations is not an answer -- it's quite feasible to put the
caching code into libibverbs if that's the best place for it)

 - R.