[ofa-general] Re: New proposal for memory management

Tue Apr 28 14:31:41 PDT 2009

Is anyone going to comment on this?  I'm surprised / disappointed that
it's been over 2 weeks with *no* comments.

Roland can't lead *every* discussion...

On Apr 13, 2009, at 12:07 PM, Jeff Squyres wrote:

> The following is a proposal from several MPI implementations to the  
> OpenFabrics community (various MPI implementation representatives  
> CC'ed).  The basic concept was introduced in the MPI Panel at Sonoma  
> (see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip) 
> ; it was further refined in discussions after Sonoma.
>
> Introduction:
> =============
>
> MPI has long had a problem maintaining its own verbs memory
> registration cache in userspace.  The main issue is that user
> applications are responsible for allocating/freeing their own data
> buffers -- the MPI layer does not (usually) have visibility when
> application buffers are allocated or freed.  Hence, MPI has had to
> intercept deallocation calls in order to know when its registration
> cache entries have potentially become invalid.  Horrible and dangerous
> tricks are used to intercept the various flavors of free, sbrk,
> munmap, etc.
>
> Here's the classic scenario we're trying to handle better:
>
> 1. MPI application allocs buffer A and MPI_SENDs it
> 2. MPI library registers buffer A and caches it (in user space)
> 3. MPI application frees buffer A
> 4. page containing buffer A is returned to the OS
> 5. MPI application allocs buffer B
>   5a. B is at the same virtual address as A, but different physical  
> address
> 6. MPI application MPI_SENDs buffer B
> 7. MPI library thinks B is already registered and sends it
>   --> the physical address may well still be registered, so the send
>       does not fail -- but it's the wrong data
>
> Note that the above scenario occurs because before Linux kernel
> v2.6.27, the OF kernel drivers are not notified when pages are
> returned to the OS -- we're leaking registered memory, and therefore
> the OF driver/hardware have the wrong virtual/physical mapping.  It
> *may* not segv at step 7 because the OF driver/hardware can still
> access the memory and it is still registered.  But it will definitely
> be accessing the wrong physical memory.
>
> In discussions before the Sonoma OpenFabrics event this year, several
> MPI implementations got together and concluded that userspace
> "notifier" functions might solve this issue for MPI (as proposed by
> Pete Wyckoff quite a while ago).  Specifically, when memory is
> unregistered down in the kernel, a flag is set in userspace that
> allows the userspace to know that it needs to make a [potentially
> expensive] downcall to find out exactly what happened.  In this way,
> MPI can know when to update its registration cache safely.
>
> After further post-Sonoma discussion, it became evident that the
> so-called userspace "notifier" functions nat not solve the problem --
> there seem to be unavoidable race conditions, particularly in
> multi-threaded applications (more on this below).  We concluded that
> what could be useful is to move the registration cache from the
> userspace/MPI down into the kernel and maintain it on a per-protection
> domain (PD) basis.
>
> Short version:
> ==============
>
> Here's a short version of our proposal:
>
> 1. A new enum value is added to ibv_access_flags: IBV_ACCESS_CACHE.
>   If this flag is set in the call to ibv_reg_mr(), the following
>   occurs down in the kernel:
>   - look for the memory to be registered in the PD-specific cache
>   - if found
>       - increment its refcount
>   - else
>       - try to register the memory
>       - if the registration fails because no more memory is available
>           - traverse all PD registration caches in this process,
>             evicting/unregistering each entry with a refcount <= 0
>           - try to register the memory again
>       - if the registration succeeds (either the 1st or the 2nd time),
>         put it in the PD cache with a refcount of 1
>
>   If this flag is *not* set in the call to ibv_reg_mr(), then the
>   following occurs:
>
>   - try to register the memory
>   - if the registration fails because no more registered memory is  
> available
>       - traverse all PD registration caches in this process,
>         evicting/unregistering each entry with a refcount <= 0
>       - try to register the memory again
>
>   If an application never uses IBV_ACCESS_CACHE, registration
>   performance should be no different.  Registration costs may
>   increase slightly in some cases if there is a non-empty
>   registration cache.
>
> 2. The kernel side of the ibv_dereg_mr() deregistration call now does
>   the following:
>   - look for the memory to be deregistered in the PD's cache
>   - if it's in the cache
>       - decrement the refcount (leaving the memory registered)
>   - else
>       - unregister the memory
>
> 3. A new verb, ibv_is_reg(), is created to query if the entire buffer
>   X is already registered.  If it is, increase its refcount in the
>   reg cache.  If it is not, just return an error (and do not register
>   any of the buffer).
>
>   --> An alternate proposal for this idea is to add another
>       ibv_access_flags value (e.g., IBV_ACCESS_IS_CACHED) instead of
>       a new verb.  But that might be a little odd in that we don't
>       want the memory registered if it's not already registered.
>
>   This verb is useful for pipelined protocols to offset the cost of
>   registration of long buffers (e.g., if the buffer is already
>   registered, just send it -- otherwise let the ULP potentially do
>   something else).  See below for a more detailed explanation / use
>   case.
>
> 4. A new verb, ibv_reg_mr_limits(), is created to specify some
>   configuration information about the registration cache.
>   Configuration specifics TBD here, but one obvious possibility here
>   would be to specify the maximum number of pages that can be
>   registered by this process (which must be <= the value specified
>   limits.conf, or it will fail).
>
> 5. A new verb, ibv_reg_mr_clean(), is created to traverse the internal
>   registration cache and actually de-register any item with a
>   refcount <= 0.  The intent is to give applications the ability to
>   forcibly deregister any still-existing memory that has been
>   ibv_reg_mr(..., IBV_ACCESS_CACHE)'ed and later ibv_dereg_mr()'ed.
>
> These proposals assume that the new IOMMU notify system in >=2.6.27
> kernels will be used to catch when memory is returned from a process
> to the kernel, and will both unregister the memory and remove it from
> the kernel PD reg caches, if relevant.
>
> More details:
> =============
>
> Starting with Linux kernel v2.6.27, the OF kernel drivers can be
> notified when pages are returned to the OS (I don't know if they yet
> take advantage of this feature).  However, we can still run into
> pretty much the same scenario -- the MPI userspace registration cache
> can become invalid even though the kernel is no longer leaking
> registered memory.  The situation is *slightly* better because the
> ibv_post_send() may fail because the memory will (in a single threaded
> application) likely be unregistered.
>
> Pete Wyckoff's solution several years ago was to add two steps into
> the scenario listed above; my understanding is this is now possible
> with the IOMMU notifiers in 2.6.27 (new steps 4a and 4b):
>
> 1. MPI application allocs buffer A and MPI_SENDs it
> 2. MPI library registers buffer A and caches it (in user space)
> 3. MPI application frees buffer A
> 4. page containing buffer A is returned to the OS
>   4a. OF kernel driver is notified and can unregister the page
>   4b. OF kernel driver can twiddle a bit in userspace indicating that
>       something has changed
> ...etc.
>
> The thought here is that the MPI can register a global variable during
> MPI_INIT that can be modified during step 4b.  Hence, you can add a
> cheap "if" statement in MPI's send path like this:
>
>  if (variable_has_changed_indicating_step_4b_executed) {
>      ibv_expensive_downcall_to_find_out_what_happened(..., &output);
>      if (need_to_register(buffer, mpi_reg_cache, output)) {
>          ibv_reg_mr(buffer, ...);
>      }
>  }
>  ibv_post_send(...);
>
> You get the idea -- check the global variable before invoking
> ibv_post_send() or ibv_post_recv(), and if necessary, register the
> memory that MPI thought was already registered.
>
> But whacky situations might occur in a multithreaded application where
> one thread calls free() while another thread calls malloc(), gets the
> same virtual address that was just free()d but has not yet been
> unregistered in the kernel, so a subsequent ibv_post_send() may
> succeed but be sending the wrong data.
>
> Put simply: in a multi-threaded application, there's always the chance
> that the notify won't get to the user-level process until after the
> global notifier variable has been checked, right?  Or, putting it the
> other way: is there any kind of notify system that could be used that
> *can't* create a potential race condition in a multi-threaded user
> application?
>
>  NOTE: There's actually some debate about whether this "bad" scenario
>        could actually happen -- I admit that I'm not entirely sure.
>        But if this race condition *can* happen, then I cannot think
>        of a kernel notifier system that would not have this race
>        condition.
>
> So a few of us hashed this around and came up with an alternate
> proposal:
>
> 1. Move the entire registration cache down into the kernel.
>   Supporting rationale:
>   1a. If all ULPs (MPIs, in this case) have to implement registration
>       caches, why not implement it *once*, not N times?
>   1b. Putting the reg cache in the kernel means that with the IOMMU
>       notifier system introduced in 2.6.27, the kernel can call back
>       to the device driver when the mapping changes so that a) the
>       memory can be deregistered, and b) the corresponding item can
>       be removed from the registration cache.  Specifically: the race
>       condition described above can be fixed because it's all located
>       in one place in the kernel.
>
> 2. This means that the userspace process must *always* call
>   ibv_reg_mr() and ibv_dereg_mr() to increment / decrement the
>   reference counts on the kernel reg cache.  But in practice,
>   on-demand registration/de-registration is only done for long
>   messages (short messages typically use
>   copy-to-pre-registered-buffers schemes).  So the additional
>   ibv_reg_mr() before calling ibv_post_send() / ibv_post_recv() for
>   long messages shouldn't matter.
>
> 3. The registration cache in the kernel can lazily deregister cached
>   memory, as described in the "short version" discussion, above
>   (quite similar to what MPI's do today).
>
> To offset the cost of large memory registrations (because registration
> is linearly proportional to the size of the buffer being registered),
> pipelined protocols are sometimes used.  As such, it seems useful to
> have a "is this memory already registered?" verb -- a ULP can check to
> see if an entire long message is already registered, and if so, do a
> single large RDMA action.  If not, the ULP can use a pipelined
> protocol to loop over registering a portion of the buffer and then
> RDMA'ing it.
>
> Possible pipelined pseudocode can look like this:
>
>  if (ibv_is_reg(pd, buffer, len)) {
>      ibv_post_send();
>      // will still need to ibv_dereg_mr() after completion
>  } else {
>      // pipeline loop
>      for (i = 0; ...) {
>          ibv_reg_mr(pd, buffer + i*pipeline_size,
>                     pipeline_size, IBV_ACCESS_CACHE);
>          ibv_post_send(...);
>      }
>  }
>
> The rationale here is that these verbs allow the flexibility of doing
> something like the above scenario or just registering the whole long
> buffer and sending it immediately:
>
>  ibv_reg_mr(pd, buffer, len, IBV_ACCESS_CACHE);
>  ibv_post_send(...);
>
> It may also be useful to progamatically enforce some limits on a given
> PD's registration cache.  A per-process limit is already enforced via
> /etc/security/limits.conf, but it may be useful to specify per-PD
> limits in the ULP (MPI) itself.  Note that most MPI's have controls
> like this already; it's consistent with moving the registration cache
> down to the kernel.  A proposal for the verb could be:
>
>  ibv_reg_mr_cache_limits(pd, max_num_pages)
>
> Another userspace-accessible verb that may be useful is one that
> traverses a PD's reg cache and actually deregisters any item with a
> refcount <= 0.  This allows a ULP to "clean out" any lingering
> registrations, thereby freeing up registered memory for other uses
> (e.g., being registered by another PD).  This verb can have a
> simplistic interface:
>
>  ibv_reg_mr_clean(pd)
>
> It's not 100% clear that we need this "clean" verb -- if ibv_reg_mr()
> will evict entries with <= 0 refcounts from any PD's registration
> cache in this process, that might be enough.  However, using verbs
> registered memory with other (non-verbs) pinned memory in the same
> process may make this verb necessary.
>
> -----
>
> Finally, it should be noted that with 2.6.27's IOMMU notify system,
> full on-demand paging / registering seems possible.  On-demand paging
> would be a full, complete solution -- the ULP wouldn't have to worry
> about registering / de-registering memory at all (the existing
> de/registration verbs could become no-ops for backwards
> compatibility).  I assume that a proposal along these lines this would
> be a [much] larger debate in the OpenFabrics community, and further
> assume that the proposal above would be a smaller debate and actually
> have a chance of being implemented in the not-distant future.
>
> (/me puts on fire suit)
>
> Thoughts?
>
> -- 
> Jeff Squyres
> Cisco Systems
>

-- 
Jeff Squyres
Cisco Systems