[ofa-general] Memory registration redux

Jason Gunthorpe jgunthorpe at obsidianresearch.com
Tue May 26 16:51:58 PDT 2009


On Tue, May 26, 2009 at 04:13:08PM -0700, Roland Dreier wrote:

>  > >  > Or, ignore the overlapping problem, and use your original technique,
>  > >  > slightly modified:
>  > >  >  - Userspace registers a counter with the kernel. Kernel pins the
>  > >  >    page, sets up mmu notifiers and increments the counter when
>  > >  >    invalidates intersect with registrations
>  > >  >  - Kernel maintains a linked list of registrations that have been
>  > >  >    invalidated via mmu notifiers using the registration structure
>  > >  >    and a dirty bit
>  > >  >  - Userspace checks the counter at every cache hit, if different it
>  > >  >    calls into the kernel:
>  > >  >        MR_Cookie *mrs[100];
>  > >  >        int rc = ibv_get_invalid_mrs(mrs,100);
>  > >  >        invalidate_cache(mrs,rc);
>  > >  >        // Repeat until drained
>  > >  > 
>  > >  >    get_invalid_mrs traverses the linked list and returns an
>  > >  >    identifying value to userspace, which looks it up in the cache,
>  > >  >    calls unregister and removes it from the cache.
>  > > 
>  > > What's the advantage of this?  I have to do the get_invalid_mrs() call a
>  > > bunch of times, rather than just reading which ones are invalid from the
>  > > cache directly?
>  > 
>  > This is a trade off, the above is a more normal kernel API and lets
>  > the app get an list of changes it can scan. Having the kernel update
>  > flags means if the app wants a list of changes it has to scan all
>  > registrations.
> 
> The more I thought about this, the more I liked the idea, until I liked
> it so much that I actually went ahead and prototyped this.  A
> preliminary version is below -- *very* lightly tested, and no doubt
> there are obvious bugs that any real use or review will uncover.  But I
> thought I'd throw it out and hope for comments and/or testing.  I'm
> actually pretty happy with how small and simple this ended up being.

Seems reasonable to me. This doesn't catch all mmap cases, ie this
kind of stuff:

 t = mmap(NULL, 3 * page_size, PROT_READ,
 		 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
 if (umn_register(t, 3 * page_size, 123))
	 	return 1;

 t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0);
 // Event? Probably

 munmap(t,page_size);
 // Event? No, no MAP_POPULATE

 t = mmap(t,page_size,PROT_READ,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,-1,0);
 // Event? No

And I guess the use of MAP_POPULATE is deliberate as thats how mmu
notifier works..

So the use model for a MPI would be to call ibv_register/umn_register
and watch for events. Any event at all means the entire region is
toast and must be re-registered the next time someone calls with that
address. ibv_register does the same as MAP_POPULATE internally..

The MPI library uses the result of this to build a list of invalided
regions. From time to time the MPI library should unregister those
regions.

If that is the use then the kernel side should probably also be a
one-shot type of interface..

I'm also trying to think of a use case outside of RDMA and failing - if
the kernel hasn't pinned the pages being watched through some other
means it seems useless as a general feature??

Jason



More information about the general mailing list