[ofa-general] Memory registration redux

Tue May 5 13:57:09 PDT 2009

Roland and I chatted on the phone today; I think I now understand  
Roland's counter-proposal (I clearly didn't before).  Let me try to  
summarize:

1. Add a new verb for "set this userspace flag to 1 if mr X ever  
becomes invalid"
2. Add a new verb for "no longer tell me if mr X ever becomes  
invalid" (i.e., remove the effects of #1)
3. Add run-time query indicating whether #1 works
4. Add [optional] memory registration caching to libibverbs

Prior to talking to Roland, I had envisioned *one* flag in userspace  
that indicated whether any memory registrations had become invalid.   
Roland's idea is that there is one flag *per registration* -- you can  
instantly tell whether a specific registration is valid.

Given this, let's keep the discussion going here in email -- perhaps  
the teleconference next Monday may become moot.

---------------------------------------------

More detail...

Here's a sample scenario:

- userspace registers memory buffer A
- userspace adds this registration to its cache
   (note: the cache could be in libibverbs; more on this below)
- userspace calls a [new] verb that says "tell me if mr X ever becomes  
invalid" and passes a pointer to a flag *in this registration's entry  
in the cache*
- userspace leaves the memory buffer A registered/cached

Some scenarios after the above has run:

1. Userspace uses buffer A again
    - userspace looks up and finds A's cached registration
    - userspace sees that this registration's flag is still 0, and  
therefore can proceed with communication

2. Application frees buffer A and it is returned to the OS (e.g, munmap)
    - IOMMU fires
    - change userspace flag corresponding to this registration to 1
    - memory is unregistered
    - pages are returned

3. Userspace uses buffer A again (after #2)
    - userspace looks up and finds A's cached registration
    - userspace sees that this registration's flag is 1
    - userspace therefore registers this memory again, and re-calls  
the verb saying "tell me if mr X ever becomes invalid" (etc.)
    - userspace proceeds with communication

The kernel has to store a little extra state for each registration  
(the address of the userspace flag to tweak if the registration ever  
becomes invalid), but it's small and bounded by the number of active  
registrations.

 From MPI's perspective, this feature would be a great step forward --  
if we can query verbs at run-time to see if this feature is active, we  
can stop using the memory allocation hooks (yay!).  Obviously, MPI's  
will need to carry the old memory allocation hooks for backwards  
compatibility for a while, but if we can effectively deprecate them,  
that would be great.

**Specifically: it's the memory allocation hooks code in MPI  
implementations that is "fragile", "brittle", etc.  Avoiding the issue  
would be great; the code becomes much more robust because we're not  
subverting the memory allocator.

A secondary feature would be to add memory registration caching to  
libibverbs.  This wouldn't be *required* for MPIs since we all have  
registration caches already, but it might be nice to deprecate/ 
eventually remove that code in an MPI implementation, too.

The use case is similar to what was proposed earlier: add a flag to  
ibv_reg_mr() indicating whether you want the registration cached or  
not.  If the registration is to be cached, libibverbs would also  
invoke the "tell me if this mr every becomes invalid" functionality.   
The MPI/application then *always* calls ibv_reg_mr() to register  
memory -- if the cache in libibverbs finds a valid matching mr, it can  
just return without a syscall.  As also described previously, calls to  
ibv_dereg_mr() do not necessarily need to actually unregister -- they  
can just mark a registration cache as "able to be evicted if necessary."

The other new verbs discussed in my prior mail would also still be  
useful (ibv_is_reg(), ibv_reg_mr_limits(), ibv_reg_mr_clean()).

**Note: the registration caches in MPI's today are not necessarily  
that complicated.  They're essentially balanced trees (e.g., in OMPI,  
it's a red-black tree).  This is not the "fragile", "brittle" code --  
it's just data structures and accounting.

=================================

I refrained from a specific new API proposal; let's argue over these  
ideas first and see if we can come to consensus.  If so, specific API  
proposals can follow.

-- 
Jeff Squyres
Cisco Systems