[ofa-general] Re: Demand paging for memory regions

Tue Feb 12 19:25:33 PST 2008

On Tue, Feb 12, 2008 at 06:35:09PM -0800, Christoph Lameter wrote:
> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
> 
> > The problem is that the existing wire protocols do not have a
> > provision for doing an 'are you ready' or 'I am not ready' exchange
> > and they are not designed to store page tables on both sides as you
> > propose. The remote side can send RDMA WRITE traffic at any time after
> > the RDMA region is established. The local side must be able to handle
> > it. There is no way to signal that a page is not ready and the remote
> > should not send.
> > 
> > This means the only possible implementation is to stall/discard at the
> > local adaptor when a RDMA WRITE is recieved for a page that has been
> > reclaimed. This is what leads to deadlock/poor performance..
> 
> You would only use the wire protocols *after* having established the RDMA 
> region. The notifier chains allows a RDMA region (or parts thereof) to be 
> down on demand by the VM. The region can be reestablished if one of 
> the side accesses it. I hope I got that right. Not much exposure to 
> Infiniband so far.

[clip explaination]

But this isn't how IB or iwarp work at all. What you describe is a
significant change to the general RDMA operation and requires changes to
both sides of the connection and the wire protocol.

A few comments on RDMA operation that might clarify things a little
bit more:
 - In RDMA (iwarp and IB versions) the hardware page tables exist to
   linearize the local memory so the remote does not need to be aware
   of non-linearities in the physical address space. The main
   motivation for this is kernel bypass where the user space app wants
   to instruct the remote side to DMA into memory using user space
   addresses. Hardware provides the page tables to switch from
   incoming user space virtual addresses to physical addresess.

   This greatly simplifies the user space programming model since you
   don't need to pass around or create s/g lists for memory that is
   already virtually continuous.

   Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables
   for access control and enforcing the liftime of the mapping.

   The page tables in the RDMA hardware exist primarily to support
   this, and not for other reasons. The pinning of pages is one part
   to support the HW page tables and one part to support the RDMA
   lifetime rules, the liftime rules are what cause problems for
   the VM.
 - The wire protocol consists of packets that say 'Write XXX bytes to
   offset YY in Region RRR'. Creating a region produces the RRR label
   and currently pins the pages. So long as the RRR label is valid the
   remote side can issue write packets at any time without any
   further synchronization. There is no wire level events associated
   with creating RRR. You can pass RRR to the other machine in any
   fashion, even using carrier pigeons :)
 - The RDMA layer is very general (ala TCP), useful protocols (like SCSI)
   are built on top of it and they specify the lifetime rules and
   protocol for exchanging RRR.

   Every protocol is different. In kernel protocols like SRP and NFS
   RDMA seem to have very short lifetimes for RRR and work more like
   pci_map_* in real SCSI hardware.
 - HPC userspace apps, like MPI apps, have different lifetime rules
   and tend to be really long lived. These people will not want
   anything that makes their OPs more expensive and also probably
   don't care too much about the VM problems you are looking at (?)
 - There is no protocol support to exchange RRR. This is all done
   by upper level protocols (ala HTTP vs TCP). You cannot assert
   and revoke RRR in a general way. Every protocol is different
   and optimized.

   This is your step 'A will then send a message to B notifying..'.
   It simply does not exist in the protocol specifications

I don't know much about Quadrics, but I would be hesitant to lump it
in too much with these RDMA semantics. Christian's comments sound like
they operate closer to what you described and that is why the have an
existing patch set. I don't know :)

What it boils down to is that to implement true removal of pages in a
general way the kernel and HCA must either drop packets or stall
incoming packets, both are big performance problems - and I can't see
many users wanting this. Enterprise style people using SCSI, NFS, etc
already have short pin periods and HPC MPI users probably won't care
about the VM issues enough to warrent the performance overhead.

Regards,
Jason