[ofa-general] Memory registration redux
Jeff Squyres
jsquyres at cisco.com
Tue May 5 13:57:09 PDT 2009
Roland and I chatted on the phone today; I think I now understand
Roland's counter-proposal (I clearly didn't before). Let me try to
summarize:
1. Add a new verb for "set this userspace flag to 1 if mr X ever
becomes invalid"
2. Add a new verb for "no longer tell me if mr X ever becomes
invalid" (i.e., remove the effects of #1)
3. Add run-time query indicating whether #1 works
4. Add [optional] memory registration caching to libibverbs
Prior to talking to Roland, I had envisioned *one* flag in userspace
that indicated whether any memory registrations had become invalid.
Roland's idea is that there is one flag *per registration* -- you can
instantly tell whether a specific registration is valid.
Given this, let's keep the discussion going here in email -- perhaps
the teleconference next Monday may become moot.
---------------------------------------------
More detail...
Here's a sample scenario:
- userspace registers memory buffer A
- userspace adds this registration to its cache
(note: the cache could be in libibverbs; more on this below)
- userspace calls a [new] verb that says "tell me if mr X ever becomes
invalid" and passes a pointer to a flag *in this registration's entry
in the cache*
- userspace leaves the memory buffer A registered/cached
Some scenarios after the above has run:
1. Userspace uses buffer A again
- userspace looks up and finds A's cached registration
- userspace sees that this registration's flag is still 0, and
therefore can proceed with communication
2. Application frees buffer A and it is returned to the OS (e.g, munmap)
- IOMMU fires
- change userspace flag corresponding to this registration to 1
- memory is unregistered
- pages are returned
3. Userspace uses buffer A again (after #2)
- userspace looks up and finds A's cached registration
- userspace sees that this registration's flag is 1
- userspace therefore registers this memory again, and re-calls
the verb saying "tell me if mr X ever becomes invalid" (etc.)
- userspace proceeds with communication
The kernel has to store a little extra state for each registration
(the address of the userspace flag to tweak if the registration ever
becomes invalid), but it's small and bounded by the number of active
registrations.
From MPI's perspective, this feature would be a great step forward --
if we can query verbs at run-time to see if this feature is active, we
can stop using the memory allocation hooks (yay!). Obviously, MPI's
will need to carry the old memory allocation hooks for backwards
compatibility for a while, but if we can effectively deprecate them,
that would be great.
**Specifically: it's the memory allocation hooks code in MPI
implementations that is "fragile", "brittle", etc. Avoiding the issue
would be great; the code becomes much more robust because we're not
subverting the memory allocator.
A secondary feature would be to add memory registration caching to
libibverbs. This wouldn't be *required* for MPIs since we all have
registration caches already, but it might be nice to deprecate/
eventually remove that code in an MPI implementation, too.
The use case is similar to what was proposed earlier: add a flag to
ibv_reg_mr() indicating whether you want the registration cached or
not. If the registration is to be cached, libibverbs would also
invoke the "tell me if this mr every becomes invalid" functionality.
The MPI/application then *always* calls ibv_reg_mr() to register
memory -- if the cache in libibverbs finds a valid matching mr, it can
just return without a syscall. As also described previously, calls to
ibv_dereg_mr() do not necessarily need to actually unregister -- they
can just mark a registration cache as "able to be evicted if necessary."
The other new verbs discussed in my prior mail would also still be
useful (ibv_is_reg(), ibv_reg_mr_limits(), ibv_reg_mr_clean()).
**Note: the registration caches in MPI's today are not necessarily
that complicated. They're essentially balanced trees (e.g., in OMPI,
it's a red-black tree). This is not the "fragile", "brittle" code --
it's just data structures and accounting.
=================================
I refrained from a specific new API proposal; let's argue over these
ideas first and see if we can come to consensus. If so, specific API
proposals can follow.
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list