[ofa-general] Re: New proposal for memory management

Tue Apr 28 15:11:08 PDT 2009

On Tue, 2009-04-28 at 14:31 -0700, Jeff Squyres wrote:
> Is anyone going to comment on this?  I'm surprised / disappointed that
> it's been over 2 weeks with *no* comments.
> 
> Roland can't lead *every* discussion...
> 
> 
> On Apr 13, 2009, at 12:07 PM, Jeff Squyres wrote:
> 
> > The following is a proposal from several MPI implementations to the
> > OpenFabrics community (various MPI implementation representatives
> > CC'ed).  The basic concept was introduced in the MPI Panel at Sonoma
> > (see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip)
> > ; it was further refined in discussions after Sonoma.
> >
> > Introduction:
> > =============
> >
> > MPI has long had a problem maintaining its own verbs memory
> > registration cache in userspace.  The main issue is that user
> > applications are responsible for allocating/freeing their own data
> > buffers -- the MPI layer does not (usually) have visibility when
> > application buffers are allocated or freed.  Hence, MPI has had to
> > intercept deallocation calls in order to know when its registration
> > cache entries have potentially become invalid.  Horrible and dangerous
> > tricks are used to intercept the various flavors of free, sbrk,
> > munmap, etc.
> >
> > Here's the classic scenario we're trying to handle better:
> >
> > 1. MPI application allocs buffer A and MPI_SENDs it
> > 2. MPI library registers buffer A and caches it (in user space)
> > 3. MPI application frees buffer A

The memory is pinned so the OS isn't going to actually free the memory.
By "alloc" and "free" I assume you mean malloc()/free() or any other
call which might increase the memory footprint of an application.
The MPI library needs to ibv_dereg_mr() the buffer before it can be
actually freed and the pages returned to the free pool.

> > 4. page containing buffer A is returned to the OS
> > 5. MPI application allocs buffer B
> >   5a. B is at the same virtual address as A, but different physical
> > address
> > 6. MPI application MPI_SENDs buffer B
> > 7. MPI library thinks B is already registered and sends it
> >   --> the physical address may well still be registered, so the send
> >       does not fail -- but it's the wrong data

Ah, free() just puts the buffer on a free list and a subsequent malloc()
can return it. The application isn't aware of the MPI library calling
ibv_reg_mr() and the MPI library isn't aware of the application
reusing the buffer differently.
The virtual to physical mapping can't change while it is pinned
so buffer B should have been written with new data overwriting
the same physical pages that buffer A used.
I would assume the application would wait for the MPI_isend() to
complete before freeing the buffer so it shouldn't be the case that
the same buffer is in the process of being sent when the application
overwrites the address and tries to send it again.

> > Note that the above scenario occurs because before Linux kernel
> > v2.6.27, the OF kernel drivers are not notified when pages are
> > returned to the OS -- we're leaking registered memory, and therefore
> > the OF driver/hardware have the wrong virtual/physical mapping.  It
> > *may* not segv at step 7 because the OF driver/hardware can still
> > access the memory and it is still registered.  But it will definitely
> > be accessing the wrong physical memory.

Well, the driver can register for callbacks when the mapping changes
but most HCA drivers aren't going to be able to use it.
The problem is that once a memory region is created, there is no way
the driver knows when an incoming or outgoing DMA might try to
reference that address. There would need to be a way to suspend DMAs,
change the mapping, and then allow DMAs to continue.
The CPU equivalent is a TLB flush after changing the page table memory.

The whole area of page pinning, mapping, unmapping, etc. between the
application, MPI library, OS, and driver is very complex and I don't
think can be designed easily via email. I wasn't at the Sonoma
conference so I don't know what was discussed. The "ideal" from
MPI library perspective is to not have to worry about memory
registrations and have the HCA somehow share the user application's
page table, faulting in IB to physical address mappings as needed.
That involves quite a bit of hardware support as well as the changes
in 2.6.27.