[ofa-general] Memory registration redux

Wed May 6 15:26:38 PDT 2009

On Wed, May 06, 2009 at 02:56:25PM -0700, Roland Dreier wrote:
>  > Yuk, doesn't this problem pretty much doom this method entirely? You
>  > can't tear down the entire registration of 0x1000 ... 0x3fff if the app
>  > does something to change 0x2000 .. 0x2fff because it may have active
>  > RDMAs going on in 0x1000 ... 0x1fff.
> 
> Yes, I guess if we try to reuse registrations like this then we run into
> trouble.  I think your example points to a problem if an app registers
> 0x1000...0x3fff and then we reuse that registration for 0x2000...0x2fff
> and also for 0x1000...0x1fff, and then the app unregisters 0x1000...0x3fff.
> 
> But we can get around this just by not ever reusing registrations that
> way -- only treat something as a cache hit if it matches the start and
> length exactly.

I can't comment on that, but it feels to me like a reasonable MPI use
model would be to do small IOs randomly from the same allocation, and
pre-hint to the library it wants that whole area cached in one shot.

>  > What about a slightly different twist.. Instead of trying to make
>  > everything synchronous in the mmu_notifier, just have a counter mapped
>  > to user space. Increment the counter whenever the mms change from the
>  > notifier. Pin the user page that contains the single counter upon
>  > starting the process so access is lockless.
>  > 
>  > In user space, check the counter before every cache lookup and if it
>  > has changed call back into the kernel to resynchronize the MR tables in
>  > the HCA to the current VM.
>  > 
>  > Avoids the locking and racing problems, keeps the fast path in the
>  > user space and avoids the above question about how to deal with
>  > arbitrary actions?
> 
> I like the simplicity of the fast path.  But it seems the slow path
> would be hard -- how exactly did you envision resynchronizing the MR
> tables?  (Considering that RDMAs might be in flight for MRs that weren't
> changed by the MM operations)

Well, this conceptually doesn't seem hard. Go through all the pages in
the MR, if any have changed then pin the new page and replace the
pages physical address in the HCA's page table. Once done, synchronize
with the hardware, then run through again and un-pin and release all
the replaced pages.

Every HCA must have the necessary primitives for this to support
register and unregister...

An RDMA that is in progress to any page that is replaced is a
'use after free' type programming error. (And this means certain wacky
uses, like using MAP_FIXED on memory that is under active RDMA,
would be unsupported without an additional call)

Doing this on a page by page basis rather than on a registration by
registration basis is granular enough to avoid the problem you
noticed.

The mmu notifiers can optionally make note of the affected pages
during the callback to reduce the workload of the syscall.

Only part I don't immediately see is how to trap creation of new VM
(ie mmap), mmu notifiers seem focused on invalidating, ie munmap()..

Jason