[ofa-general] Re: New proposal for memory management
Jeff Squyres
jsquyres at cisco.com
Tue Apr 28 14:31:41 PDT 2009
Is anyone going to comment on this? I'm surprised / disappointed that
it's been over 2 weeks with *no* comments.
Roland can't lead *every* discussion...
On Apr 13, 2009, at 12:07 PM, Jeff Squyres wrote:
> The following is a proposal from several MPI implementations to the
> OpenFabrics community (various MPI implementation representatives
> CC'ed). The basic concept was introduced in the MPI Panel at Sonoma
> (see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip)
> ; it was further refined in discussions after Sonoma.
>
> Introduction:
> =============
>
> MPI has long had a problem maintaining its own verbs memory
> registration cache in userspace. The main issue is that user
> applications are responsible for allocating/freeing their own data
> buffers -- the MPI layer does not (usually) have visibility when
> application buffers are allocated or freed. Hence, MPI has had to
> intercept deallocation calls in order to know when its registration
> cache entries have potentially become invalid. Horrible and dangerous
> tricks are used to intercept the various flavors of free, sbrk,
> munmap, etc.
>
> Here's the classic scenario we're trying to handle better:
>
> 1. MPI application allocs buffer A and MPI_SENDs it
> 2. MPI library registers buffer A and caches it (in user space)
> 3. MPI application frees buffer A
> 4. page containing buffer A is returned to the OS
> 5. MPI application allocs buffer B
> 5a. B is at the same virtual address as A, but different physical
> address
> 6. MPI application MPI_SENDs buffer B
> 7. MPI library thinks B is already registered and sends it
> --> the physical address may well still be registered, so the send
> does not fail -- but it's the wrong data
>
> Note that the above scenario occurs because before Linux kernel
> v2.6.27, the OF kernel drivers are not notified when pages are
> returned to the OS -- we're leaking registered memory, and therefore
> the OF driver/hardware have the wrong virtual/physical mapping. It
> *may* not segv at step 7 because the OF driver/hardware can still
> access the memory and it is still registered. But it will definitely
> be accessing the wrong physical memory.
>
> In discussions before the Sonoma OpenFabrics event this year, several
> MPI implementations got together and concluded that userspace
> "notifier" functions might solve this issue for MPI (as proposed by
> Pete Wyckoff quite a while ago). Specifically, when memory is
> unregistered down in the kernel, a flag is set in userspace that
> allows the userspace to know that it needs to make a [potentially
> expensive] downcall to find out exactly what happened. In this way,
> MPI can know when to update its registration cache safely.
>
> After further post-Sonoma discussion, it became evident that the
> so-called userspace "notifier" functions nat not solve the problem --
> there seem to be unavoidable race conditions, particularly in
> multi-threaded applications (more on this below). We concluded that
> what could be useful is to move the registration cache from the
> userspace/MPI down into the kernel and maintain it on a per-protection
> domain (PD) basis.
>
> Short version:
> ==============
>
> Here's a short version of our proposal:
>
> 1. A new enum value is added to ibv_access_flags: IBV_ACCESS_CACHE.
> If this flag is set in the call to ibv_reg_mr(), the following
> occurs down in the kernel:
> - look for the memory to be registered in the PD-specific cache
> - if found
> - increment its refcount
> - else
> - try to register the memory
> - if the registration fails because no more memory is available
> - traverse all PD registration caches in this process,
> evicting/unregistering each entry with a refcount <= 0
> - try to register the memory again
> - if the registration succeeds (either the 1st or the 2nd time),
> put it in the PD cache with a refcount of 1
>
> If this flag is *not* set in the call to ibv_reg_mr(), then the
> following occurs:
>
> - try to register the memory
> - if the registration fails because no more registered memory is
> available
> - traverse all PD registration caches in this process,
> evicting/unregistering each entry with a refcount <= 0
> - try to register the memory again
>
> If an application never uses IBV_ACCESS_CACHE, registration
> performance should be no different. Registration costs may
> increase slightly in some cases if there is a non-empty
> registration cache.
>
> 2. The kernel side of the ibv_dereg_mr() deregistration call now does
> the following:
> - look for the memory to be deregistered in the PD's cache
> - if it's in the cache
> - decrement the refcount (leaving the memory registered)
> - else
> - unregister the memory
>
> 3. A new verb, ibv_is_reg(), is created to query if the entire buffer
> X is already registered. If it is, increase its refcount in the
> reg cache. If it is not, just return an error (and do not register
> any of the buffer).
>
> --> An alternate proposal for this idea is to add another
> ibv_access_flags value (e.g., IBV_ACCESS_IS_CACHED) instead of
> a new verb. But that might be a little odd in that we don't
> want the memory registered if it's not already registered.
>
> This verb is useful for pipelined protocols to offset the cost of
> registration of long buffers (e.g., if the buffer is already
> registered, just send it -- otherwise let the ULP potentially do
> something else). See below for a more detailed explanation / use
> case.
>
> 4. A new verb, ibv_reg_mr_limits(), is created to specify some
> configuration information about the registration cache.
> Configuration specifics TBD here, but one obvious possibility here
> would be to specify the maximum number of pages that can be
> registered by this process (which must be <= the value specified
> limits.conf, or it will fail).
>
> 5. A new verb, ibv_reg_mr_clean(), is created to traverse the internal
> registration cache and actually de-register any item with a
> refcount <= 0. The intent is to give applications the ability to
> forcibly deregister any still-existing memory that has been
> ibv_reg_mr(..., IBV_ACCESS_CACHE)'ed and later ibv_dereg_mr()'ed.
>
> These proposals assume that the new IOMMU notify system in >=2.6.27
> kernels will be used to catch when memory is returned from a process
> to the kernel, and will both unregister the memory and remove it from
> the kernel PD reg caches, if relevant.
>
> More details:
> =============
>
> Starting with Linux kernel v2.6.27, the OF kernel drivers can be
> notified when pages are returned to the OS (I don't know if they yet
> take advantage of this feature). However, we can still run into
> pretty much the same scenario -- the MPI userspace registration cache
> can become invalid even though the kernel is no longer leaking
> registered memory. The situation is *slightly* better because the
> ibv_post_send() may fail because the memory will (in a single threaded
> application) likely be unregistered.
>
> Pete Wyckoff's solution several years ago was to add two steps into
> the scenario listed above; my understanding is this is now possible
> with the IOMMU notifiers in 2.6.27 (new steps 4a and 4b):
>
> 1. MPI application allocs buffer A and MPI_SENDs it
> 2. MPI library registers buffer A and caches it (in user space)
> 3. MPI application frees buffer A
> 4. page containing buffer A is returned to the OS
> 4a. OF kernel driver is notified and can unregister the page
> 4b. OF kernel driver can twiddle a bit in userspace indicating that
> something has changed
> ...etc.
>
> The thought here is that the MPI can register a global variable during
> MPI_INIT that can be modified during step 4b. Hence, you can add a
> cheap "if" statement in MPI's send path like this:
>
> if (variable_has_changed_indicating_step_4b_executed) {
> ibv_expensive_downcall_to_find_out_what_happened(..., &output);
> if (need_to_register(buffer, mpi_reg_cache, output)) {
> ibv_reg_mr(buffer, ...);
> }
> }
> ibv_post_send(...);
>
> You get the idea -- check the global variable before invoking
> ibv_post_send() or ibv_post_recv(), and if necessary, register the
> memory that MPI thought was already registered.
>
> But whacky situations might occur in a multithreaded application where
> one thread calls free() while another thread calls malloc(), gets the
> same virtual address that was just free()d but has not yet been
> unregistered in the kernel, so a subsequent ibv_post_send() may
> succeed but be sending the wrong data.
>
> Put simply: in a multi-threaded application, there's always the chance
> that the notify won't get to the user-level process until after the
> global notifier variable has been checked, right? Or, putting it the
> other way: is there any kind of notify system that could be used that
> *can't* create a potential race condition in a multi-threaded user
> application?
>
> NOTE: There's actually some debate about whether this "bad" scenario
> could actually happen -- I admit that I'm not entirely sure.
> But if this race condition *can* happen, then I cannot think
> of a kernel notifier system that would not have this race
> condition.
>
> So a few of us hashed this around and came up with an alternate
> proposal:
>
> 1. Move the entire registration cache down into the kernel.
> Supporting rationale:
> 1a. If all ULPs (MPIs, in this case) have to implement registration
> caches, why not implement it *once*, not N times?
> 1b. Putting the reg cache in the kernel means that with the IOMMU
> notifier system introduced in 2.6.27, the kernel can call back
> to the device driver when the mapping changes so that a) the
> memory can be deregistered, and b) the corresponding item can
> be removed from the registration cache. Specifically: the race
> condition described above can be fixed because it's all located
> in one place in the kernel.
>
> 2. This means that the userspace process must *always* call
> ibv_reg_mr() and ibv_dereg_mr() to increment / decrement the
> reference counts on the kernel reg cache. But in practice,
> on-demand registration/de-registration is only done for long
> messages (short messages typically use
> copy-to-pre-registered-buffers schemes). So the additional
> ibv_reg_mr() before calling ibv_post_send() / ibv_post_recv() for
> long messages shouldn't matter.
>
> 3. The registration cache in the kernel can lazily deregister cached
> memory, as described in the "short version" discussion, above
> (quite similar to what MPI's do today).
>
> To offset the cost of large memory registrations (because registration
> is linearly proportional to the size of the buffer being registered),
> pipelined protocols are sometimes used. As such, it seems useful to
> have a "is this memory already registered?" verb -- a ULP can check to
> see if an entire long message is already registered, and if so, do a
> single large RDMA action. If not, the ULP can use a pipelined
> protocol to loop over registering a portion of the buffer and then
> RDMA'ing it.
>
> Possible pipelined pseudocode can look like this:
>
> if (ibv_is_reg(pd, buffer, len)) {
> ibv_post_send();
> // will still need to ibv_dereg_mr() after completion
> } else {
> // pipeline loop
> for (i = 0; ...) {
> ibv_reg_mr(pd, buffer + i*pipeline_size,
> pipeline_size, IBV_ACCESS_CACHE);
> ibv_post_send(...);
> }
> }
>
> The rationale here is that these verbs allow the flexibility of doing
> something like the above scenario or just registering the whole long
> buffer and sending it immediately:
>
> ibv_reg_mr(pd, buffer, len, IBV_ACCESS_CACHE);
> ibv_post_send(...);
>
> It may also be useful to progamatically enforce some limits on a given
> PD's registration cache. A per-process limit is already enforced via
> /etc/security/limits.conf, but it may be useful to specify per-PD
> limits in the ULP (MPI) itself. Note that most MPI's have controls
> like this already; it's consistent with moving the registration cache
> down to the kernel. A proposal for the verb could be:
>
> ibv_reg_mr_cache_limits(pd, max_num_pages)
>
> Another userspace-accessible verb that may be useful is one that
> traverses a PD's reg cache and actually deregisters any item with a
> refcount <= 0. This allows a ULP to "clean out" any lingering
> registrations, thereby freeing up registered memory for other uses
> (e.g., being registered by another PD). This verb can have a
> simplistic interface:
>
> ibv_reg_mr_clean(pd)
>
> It's not 100% clear that we need this "clean" verb -- if ibv_reg_mr()
> will evict entries with <= 0 refcounts from any PD's registration
> cache in this process, that might be enough. However, using verbs
> registered memory with other (non-verbs) pinned memory in the same
> process may make this verb necessary.
>
> -----
>
> Finally, it should be noted that with 2.6.27's IOMMU notify system,
> full on-demand paging / registering seems possible. On-demand paging
> would be a full, complete solution -- the ULP wouldn't have to worry
> about registering / de-registering memory at all (the existing
> de/registration verbs could become no-ops for backwards
> compatibility). I assume that a proposal along these lines this would
> be a [much] larger debate in the OpenFabrics community, and further
> assume that the proposal above would be a smaller debate and actually
> have a chance of being implemented in the not-distant future.
>
> (/me puts on fire suit)
>
> Thoughts?
>
> --
> Jeff Squyres
> Cisco Systems
>
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list