[ofa-general] Re: New proposal for memory management

Ralph Campbell ralph.campbell at qlogic.com
Wed Apr 29 14:04:18 PDT 2009


On Tue, 2009-04-28 at 18:10 -0700, Jeff Squyres wrote:
> On Apr 28, 2009, at 6:11 PM, Ralph Campbell wrote:
> 
> > Ah, free() just puts the buffer on a free list and a subsequent  
> > malloc()
> > can return it. The application isn't aware of the MPI library calling
> > ibv_reg_mr()
> >
> 
> Right.
> 
> > and the MPI library isn't aware of the application
> > reusing the buffer differently.
> > The virtual to physical mapping can't change while it is pinned
> > so buffer B should have been written with new data overwriting
> > the same physical pages that buffer A used.
> > I would assume the application would wait for the MPI_isend() to
> > complete before freeing the buffer so it shouldn't be the case that
> > the same buffer is in the process of being sent when the application
> > overwrites the address and tries to send it again.
> >
> 
> This is not the problem.
> 
> An MPI program that re-uses a buffer that is in use in an ongoing non- 
> blocking send operation is clearly erroneous.
> 
> Perhaps my explanations were incorrect and you kernel gurus can  
> educate me.  What I know can happen is:
> 
> - MPI application alloc's buffer A and gets virtual address B back,  
> corresponding to physical address C
> - MPI application calls MPI_SEND with A
> 
> - MPI implementation registers buffer A, and caches that address B is  
> registered, and then does the send
> 
> - MPI application frees buffer A
> 
> - MPI implementation does *NOT* unregister buffer A
> 
> - MPI application alloc's buffer X and gets virtual address *B* back,  
> corresponding to physical address Z (Z!=C)
> - MPI application calls MPI_SEND with X
> 
> - MPI implementation sees virtual address B in its cache and thinks  
> that it is already registered... badness ensues
> 
> Note that the virtual addresses are the same, but the physical  
> addresses are different.  This can, and does, happen.  It makes it  
> impossible to tell the buffer apart in userspace -- MPI cannot tell  
> that the buffer is not already pinned (because according to MPI's  
> internal cache, it *is* registered already).  The only way to hack  
> around this is for the MPI implementation to intercept free/sbrk/ 
> whatever (horrors!) so that it can a) know to unregister the buffer  
> and b) remove the address from its "already registered" cache.
> 
> It's quite possible that I don't know why this happens, or stated the  
> wrong reasons why.  But it definitely does happen.

The problem is that MPI needs to be aware of the application doing
the free() and unregister or flush its MR cache for that virtual
address range. Of course it would be difficult for OpenMPI to have
callbacks or hooks into every way memory could be allocated/freed
that an application might use.

It seems to me that this is mostly an issue for rendezvous sends.
Eager sends can use a pool of preregistered memory which are
reused as data is copied from the buffer and ibv_post_recv()'ed.

At least now, I think I understand your issue.




More information about the general mailing list