[ofa-general] New proposal for memory management

Mon Apr 13 09:07:17 PDT 2009

The following is a proposal from several MPI implementations to the  
OpenFabrics community (various MPI implementation representatives  
CC'ed).  The basic concept was introduced in the MPI Panel at Sonoma  
(see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip) 
; it was further refined in discussions after Sonoma.

Introduction:
=============

MPI has long had a problem maintaining its own verbs memory
registration cache in userspace.  The main issue is that user
applications are responsible for allocating/freeing their own data
buffers -- the MPI layer does not (usually) have visibility when
application buffers are allocated or freed.  Hence, MPI has had to
intercept deallocation calls in order to know when its registration
cache entries have potentially become invalid.  Horrible and dangerous
tricks are used to intercept the various flavors of free, sbrk,
munmap, etc.

Here's the classic scenario we're trying to handle better:

1. MPI application allocs buffer A and MPI_SENDs it
2. MPI library registers buffer A and caches it (in user space)
3. MPI application frees buffer A
4. page containing buffer A is returned to the OS
5. MPI application allocs buffer B
    5a. B is at the same virtual address as A, but different physical  
address
6. MPI application MPI_SENDs buffer B
7. MPI library thinks B is already registered and sends it
    --> the physical address may well still be registered, so the send
        does not fail -- but it's the wrong data

Note that the above scenario occurs because before Linux kernel
v2.6.27, the OF kernel drivers are not notified when pages are
returned to the OS -- we're leaking registered memory, and therefore
the OF driver/hardware have the wrong virtual/physical mapping.  It
*may* not segv at step 7 because the OF driver/hardware can still
access the memory and it is still registered.  But it will definitely
be accessing the wrong physical memory.

In discussions before the Sonoma OpenFabrics event this year, several
MPI implementations got together and concluded that userspace
"notifier" functions might solve this issue for MPI (as proposed by
Pete Wyckoff quite a while ago).  Specifically, when memory is
unregistered down in the kernel, a flag is set in userspace that
allows the userspace to know that it needs to make a [potentially
expensive] downcall to find out exactly what happened.  In this way,
MPI can know when to update its registration cache safely.

After further post-Sonoma discussion, it became evident that the
so-called userspace "notifier" functions nat not solve the problem --
there seem to be unavoidable race conditions, particularly in
multi-threaded applications (more on this below).  We concluded that
what could be useful is to move the registration cache from the
userspace/MPI down into the kernel and maintain it on a per-protection
domain (PD) basis.

Short version:
==============

Here's a short version of our proposal:

1. A new enum value is added to ibv_access_flags: IBV_ACCESS_CACHE.
    If this flag is set in the call to ibv_reg_mr(), the following
    occurs down in the kernel:
    - look for the memory to be registered in the PD-specific cache
    - if found
        - increment its refcount
    - else
        - try to register the memory
        - if the registration fails because no more memory is available
            - traverse all PD registration caches in this process,
              evicting/unregistering each entry with a refcount <= 0
            - try to register the memory again
        - if the registration succeeds (either the 1st or the 2nd time),
          put it in the PD cache with a refcount of 1

    If this flag is *not* set in the call to ibv_reg_mr(), then the
    following occurs:

    - try to register the memory
    - if the registration fails because no more registered memory is  
available
        - traverse all PD registration caches in this process,
          evicting/unregistering each entry with a refcount <= 0
        - try to register the memory again

    If an application never uses IBV_ACCESS_CACHE, registration
    performance should be no different.  Registration costs may
    increase slightly in some cases if there is a non-empty
    registration cache.

2. The kernel side of the ibv_dereg_mr() deregistration call now does
    the following:
    - look for the memory to be deregistered in the PD's cache
    - if it's in the cache
        - decrement the refcount (leaving the memory registered)
    - else
        - unregister the memory

3. A new verb, ibv_is_reg(), is created to query if the entire buffer
    X is already registered.  If it is, increase its refcount in the
    reg cache.  If it is not, just return an error (and do not register
    any of the buffer).

    --> An alternate proposal for this idea is to add another
        ibv_access_flags value (e.g., IBV_ACCESS_IS_CACHED) instead of
        a new verb.  But that might be a little odd in that we don't
        want the memory registered if it's not already registered.

    This verb is useful for pipelined protocols to offset the cost of
    registration of long buffers (e.g., if the buffer is already
    registered, just send it -- otherwise let the ULP potentially do
    something else).  See below for a more detailed explanation / use
    case.

4. A new verb, ibv_reg_mr_limits(), is created to specify some
    configuration information about the registration cache.
    Configuration specifics TBD here, but one obvious possibility here
    would be to specify the maximum number of pages that can be
    registered by this process (which must be <= the value specified
    limits.conf, or it will fail).

5. A new verb, ibv_reg_mr_clean(), is created to traverse the internal
    registration cache and actually de-register any item with a
    refcount <= 0.  The intent is to give applications the ability to
    forcibly deregister any still-existing memory that has been
    ibv_reg_mr(..., IBV_ACCESS_CACHE)'ed and later ibv_dereg_mr()'ed.

These proposals assume that the new IOMMU notify system in >=2.6.27
kernels will be used to catch when memory is returned from a process
to the kernel, and will both unregister the memory and remove it from
the kernel PD reg caches, if relevant.

More details:
=============

Starting with Linux kernel v2.6.27, the OF kernel drivers can be
notified when pages are returned to the OS (I don't know if they yet
take advantage of this feature).  However, we can still run into
pretty much the same scenario -- the MPI userspace registration cache
can become invalid even though the kernel is no longer leaking
registered memory.  The situation is *slightly* better because the
ibv_post_send() may fail because the memory will (in a single threaded
application) likely be unregistered.

Pete Wyckoff's solution several years ago was to add two steps into
the scenario listed above; my understanding is this is now possible
with the IOMMU notifiers in 2.6.27 (new steps 4a and 4b):

1. MPI application allocs buffer A and MPI_SENDs it
2. MPI library registers buffer A and caches it (in user space)
3. MPI application frees buffer A
4. page containing buffer A is returned to the OS
    4a. OF kernel driver is notified and can unregister the page
    4b. OF kernel driver can twiddle a bit in userspace indicating that
        something has changed
...etc.

The thought here is that the MPI can register a global variable during
MPI_INIT that can be modified during step 4b.  Hence, you can add a
cheap "if" statement in MPI's send path like this:

   if (variable_has_changed_indicating_step_4b_executed) {
       ibv_expensive_downcall_to_find_out_what_happened(..., &output);
       if (need_to_register(buffer, mpi_reg_cache, output)) {
           ibv_reg_mr(buffer, ...);
       }
   }
   ibv_post_send(...);

You get the idea -- check the global variable before invoking
ibv_post_send() or ibv_post_recv(), and if necessary, register the
memory that MPI thought was already registered.

But whacky situations might occur in a multithreaded application where
one thread calls free() while another thread calls malloc(), gets the
same virtual address that was just free()d but has not yet been
unregistered in the kernel, so a subsequent ibv_post_send() may
succeed but be sending the wrong data.

Put simply: in a multi-threaded application, there's always the chance
that the notify won't get to the user-level process until after the
global notifier variable has been checked, right?  Or, putting it the
other way: is there any kind of notify system that could be used that
*can't* create a potential race condition in a multi-threaded user
application?

   NOTE: There's actually some debate about whether this "bad" scenario
         could actually happen -- I admit that I'm not entirely sure.
         But if this race condition *can* happen, then I cannot think
         of a kernel notifier system that would not have this race
         condition.

So a few of us hashed this around and came up with an alternate
proposal:

1. Move the entire registration cache down into the kernel.
    Supporting rationale:
    1a. If all ULPs (MPIs, in this case) have to implement registration
        caches, why not implement it *once*, not N times?
    1b. Putting the reg cache in the kernel means that with the IOMMU
        notifier system introduced in 2.6.27, the kernel can call back
        to the device driver when the mapping changes so that a) the
        memory can be deregistered, and b) the corresponding item can
        be removed from the registration cache.  Specifically: the race
        condition described above can be fixed because it's all located
        in one place in the kernel.

2. This means that the userspace process must *always* call
    ibv_reg_mr() and ibv_dereg_mr() to increment / decrement the
    reference counts on the kernel reg cache.  But in practice,
    on-demand registration/de-registration is only done for long
    messages (short messages typically use
    copy-to-pre-registered-buffers schemes).  So the additional
    ibv_reg_mr() before calling ibv_post_send() / ibv_post_recv() for
    long messages shouldn't matter.

3. The registration cache in the kernel can lazily deregister cached
    memory, as described in the "short version" discussion, above
    (quite similar to what MPI's do today).

To offset the cost of large memory registrations (because registration
is linearly proportional to the size of the buffer being registered),
pipelined protocols are sometimes used.  As such, it seems useful to
have a "is this memory already registered?" verb -- a ULP can check to
see if an entire long message is already registered, and if so, do a
single large RDMA action.  If not, the ULP can use a pipelined
protocol to loop over registering a portion of the buffer and then
RDMA'ing it.

Possible pipelined pseudocode can look like this:

   if (ibv_is_reg(pd, buffer, len)) {
       ibv_post_send();
       // will still need to ibv_dereg_mr() after completion
   } else {
       // pipeline loop
       for (i = 0; ...) {
           ibv_reg_mr(pd, buffer + i*pipeline_size,
                      pipeline_size, IBV_ACCESS_CACHE);
           ibv_post_send(...);
       }
   }

The rationale here is that these verbs allow the flexibility of doing
something like the above scenario or just registering the whole long
buffer and sending it immediately:

   ibv_reg_mr(pd, buffer, len, IBV_ACCESS_CACHE);
   ibv_post_send(...);

It may also be useful to progamatically enforce some limits on a given
PD's registration cache.  A per-process limit is already enforced via
/etc/security/limits.conf, but it may be useful to specify per-PD
limits in the ULP (MPI) itself.  Note that most MPI's have controls
like this already; it's consistent with moving the registration cache
down to the kernel.  A proposal for the verb could be:

   ibv_reg_mr_cache_limits(pd, max_num_pages)

Another userspace-accessible verb that may be useful is one that
traverses a PD's reg cache and actually deregisters any item with a
refcount <= 0.  This allows a ULP to "clean out" any lingering
registrations, thereby freeing up registered memory for other uses
(e.g., being registered by another PD).  This verb can have a
simplistic interface:

   ibv_reg_mr_clean(pd)

It's not 100% clear that we need this "clean" verb -- if ibv_reg_mr()
will evict entries with <= 0 refcounts from any PD's registration
cache in this process, that might be enough.  However, using verbs
registered memory with other (non-verbs) pinned memory in the same
process may make this verb necessary.

-----

Finally, it should be noted that with 2.6.27's IOMMU notify system,
full on-demand paging / registering seems possible.  On-demand paging
would be a full, complete solution -- the ULP wouldn't have to worry
about registering / de-registering memory at all (the existing
de/registration verbs could become no-ops for backwards
compatibility).  I assume that a proposal along these lines this would
be a [much] larger debate in the OpenFabrics community, and further
assume that the proposal above would be a smaller debate and actually
have a chance of being implemented in the not-distant future.

(/me puts on fire suit)

Thoughts?

-- 
Jeff Squyres
Cisco Systems