[ofa-general] Re: New proposal for memory management

Jeff Squyres jsquyres at cisco.com
Tue Apr 28 18:10:35 PDT 2009


On Apr 28, 2009, at 6:11 PM, Ralph Campbell wrote:

> Ah, free() just puts the buffer on a free list and a subsequent  
> malloc()
> can return it. The application isn't aware of the MPI library calling
> ibv_reg_mr()
>

Right.

> and the MPI library isn't aware of the application
> reusing the buffer differently.
> The virtual to physical mapping can't change while it is pinned
> so buffer B should have been written with new data overwriting
> the same physical pages that buffer A used.
> I would assume the application would wait for the MPI_isend() to
> complete before freeing the buffer so it shouldn't be the case that
> the same buffer is in the process of being sent when the application
> overwrites the address and tries to send it again.
>

This is not the problem.

An MPI program that re-uses a buffer that is in use in an ongoing non- 
blocking send operation is clearly erroneous.

Perhaps my explanations were incorrect and you kernel gurus can  
educate me.  What I know can happen is:

- MPI application alloc's buffer A and gets virtual address B back,  
corresponding to physical address C
- MPI application calls MPI_SEND with A

- MPI implementation registers buffer A, and caches that address B is  
registered, and then does the send

- MPI application frees buffer A

- MPI implementation does *NOT* unregister buffer A

- MPI application alloc's buffer X and gets virtual address *B* back,  
corresponding to physical address Z (Z!=C)
- MPI application calls MPI_SEND with X

- MPI implementation sees virtual address B in its cache and thinks  
that it is already registered... badness ensues

Note that the virtual addresses are the same, but the physical  
addresses are different.  This can, and does, happen.  It makes it  
impossible to tell the buffer apart in userspace -- MPI cannot tell  
that the buffer is not already pinned (because according to MPI's  
internal cache, it *is* registered already).  The only way to hack  
around this is for the MPI implementation to intercept free/sbrk/ 
whatever (horrors!) so that it can a) know to unregister the buffer  
and b) remove the address from its "already registered" cache.

It's quite possible that I don't know why this happens, or stated the  
wrong reasons why.  But it definitely does happen.

> > > Note that the above scenario occurs because before Linux kernel
> > > v2.6.27, the OF kernel drivers are not notified when pages are
> > > returned to the OS -- we're leaking registered memory, and  
> therefore
> > > the OF driver/hardware have the wrong virtual/physical mapping.   
> It
> > > *may* not segv at step 7 because the OF driver/hardware can still
> > > access the memory and it is still registered.  But it will  
> definitely
> > > be accessing the wrong physical memory.
>
> Well, the driver can register for callbacks when the mapping changes
> but most HCA drivers aren't going to be able to use it.
> The problem is that once a memory region is created, there is no way
> the driver knows when an incoming or outgoing DMA might try to
> reference that address.
>

Wouldn't it be an erroneous program that tried to use a region after  
free()'ing it?

> There would need to be a way to suspend DMAs,
> change the mapping, and then allow DMAs to continue.
> The CPU equivalent is a TLB flush after changing the page table  
> memory.
>
> The whole area of page pinning, mapping, unmapping, etc. between the
> application, MPI library, OS, and driver is very complex and I don't
> think can be designed easily via email.
>

The conversation needs to start somewhere.  MPI is verbs' biggest  
customer; this is a major pain point for all of us.  Can't we fix it?   
Do you need something more than a specific use case and API proposal  
to start the conversation?  No one has money to travel travel; the bi- 
weekly EWG call is for discussing bugs.  What other vehicle do you  
suggest for this discussion?

I'd consider this issue to be in the top 3 major roadblocks of verbs  
adoption to developers other than those of us who write MPI  
implementations.

> I wasn't at the Sonoma
> conference so I don't know what was discussed.
>

Only the problem was discussed.  It was hypothesized that Pete  
Wyckoff's "tweak a bit in userspace when something changes" notifier  
interface would fix the problem, but per my mail, after more post- 
Sonoma discussion, we think that it's not sufficient.

My Sonoma slides are here:

     http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip
     http://www.openfabrics.org/archives/spring2009sonoma/wednesday/panel1/panel1.zip

> The "ideal" from
> MPI library perspective is to not have to worry about memory
> registrations and have the HCA somehow share the user application's
> page table, faulting in IB to physical address mappings as needed.
> That involves quite a bit of hardware support as well as the changes
> in 2.6.27.
>


Understood -- but as I stated in my mail, I assume that such a change  
is a long way off (particularly since it needs some kind of hardware  
support).  Moving the registration cache down into the kernel seems do- 
able.  Why not try to tackle this [enormous] problem?

-- 
Jeff Squyres
Cisco Systems




More information about the general mailing list