[ofa-general] RE: Demand paging for memory regions

Caitlin Bestler Caitlin.Bestler at neterion.com
Wed Feb 13 12:05:37 PST 2008


I have a few comments on the semantics of memory regions, and how it
relates to usage scenarios for memory notifiers and/or page faulting.

First, there is nothing in RDMA semantics that demands that each page
of a memory region be pre-mapped to a physical page before the page
can be advertised remotely.

What is expected is that these advertisements not be at risk. There
has to be an honest expectation that if a 40 page buffer is advertised
that there are 40 pages available to back that advertisement.

It is simply unacceptable for one end of an RDMA connection to back
up the network because it cannot plan its buffer allocations. Network
retransmission is not a handy spare scratchpad where  buffers can be
"cached" via retransmission.

This is somewhat akin to guaranteeing a landing slot for an airplane.
You don't really need to 'pin' the landing resources for the specific
plane for the entire duration of the flight, but you better have more
than just good intentions to make your best effort to find somewhere
for the plane to land when it finally arrives.

When there is no buffer available, there has been a connection. Having
failed to meet the requirements the receiver should assume that the
connection will be torn down. But there is a little bit of wiggle room
here. There is no need to mandate that the connection MUST be torn down.
This was explicitly discussed by the IETF's RDDP working group while 
drafting the iWARP RFCs. If there is a fault, the connection MAY be torn
down, but an implementation MAY take extra steps as part of a
fault-tolerance
strategy to avoid this. Dropping a packet and generating a page fault to
the
host as a fault-recovery strategy is a legitimate option. But
applications
MUST NOT rely on the transport layer having this service. It's somewhat
like
catching divide by zero errors. It's nice if the OS/library/compiler
build
in mechanisms to recover from divide by zero errors, but that does not
mean
that applications should go around dividing by zero.

RDMA wire semantics requires that a sufficient number of pages are
committed,
and that these are the pages as they will be viewed by the application.
There
is nothing in the protocol that is inconsistent with an OS or Hypervisor
*substituting* pages in a memory region (as long as it is done in a way
that
honors updates to those pages). Great care must be taken when
substituting
pages that are DMA accessible, but substituting pages out from under a
running
application isn't exactly trivial either. Virtual Memory Managers
(either OS
or hypervisor) should be presumed to understand when they have to
preserve
the contents of a page. RDMA presents some special challenges here
because
the RDMA layer has no knowledge of the intended usage of tagged memory
buffers, nor does it track the history of access using R-Keys/STags.

So the RDMA protocols do allow flexibility in what an R-Key/STag maps to
even while the R-Key or STag is externally advertised. But existing RDMA
verb
have no support for updating the meaning of an R-Key/STag without first
invalidating it. However, that is a verbs/implementation issue -- not an
RDMA wire protocol requirement. New APIs that allow Virtual Memory
Managers
to substitute pages in a Memory Region are feasible and may have
valuable
use cases, but they need to be introduced on an evolutionary basis.
Existing
hardware  will not support them.

But as long as such features are not used to enable irresponsible over-
subscription of pages there is no reason why new devices could not
support
such concepts (or even sufficiently updatable devices). RDMA devices
already
generate a "fault" when they cannot place to host memory. The difference
is
whether they can be instructed to drop the packet before acking it
rather
than terminating the connection. And the host can respond to the fault
either
by terminating the connection or by repairing the problem.




More information about the general mailing list