[openib-general] Getting rid of pinned memory requirement

Mon Mar 14 17:06:19 PST 2005

On Mon, Mar 14, 2005 at 04:33:19PM -0800, Caitlin Bestler wrote:
> > 
> > While hardware designers may like this idea, I would like to 
> > make the point that if you want the application to 
> > *absolutely* control the availability of physical memory, you 
> > shouldn't be writing userspace applications that run on Linux.
> > 
> 
> This is not just a hardware design issue. It is fundamental to
> why RDMA is able to optimize end-to-end traffic flow. The application
> is directly advertising the availability of buffers (through RKeys)
> to the other side. It is bad network engineering for the kernel
> to revoke that good faith advertisement and count on the HCA/RNIC
> to say "oops" when the data does arrive but the targeted buffer
> is not in memory.
> 
> But that does not mean that you cannot design mechanisms below
> the application to allow the kernel to re-organize physical
> memory -- it just means that the kernel had best not be playing
> overcommit tricks behind the applications back.
> 
> To use a banking analogy, an adverised RKey is like a certified
> check. The application has sent this RKey to its peer, and it
> expects the HCA/RNIC to honor that check when RDMA Writes are
> made to that memory.  But just as a bank does not have to 
> guarantee in advance which specific bills will be used to
> cash a guaranteed check, there is nothing to say that the
> virtual to physical mappings are permanent and immutable.
> 
> It would be possible to design an interface that allowed
> the kernel to:
> 
> a)	suspend the use of a memory region.
> 		1) outputs referencing the suspend LKey would be
> 			temporarily held by the HCA/RNIC.
> 		2) inputs referencing the suspend memory region
> 			would be delayed (RNR NAK, internal buffers,
> 			etc.)
> 		3) possibly ask the peer to similarly suspend
> 			sending. This is trickier though.
> b)	Update the virtual to physical mappings, or at least
> 	provide the RDMA layer with "physical page X replaced
> 	by physical page Y".
> c)	unsuspend the memory region.
> 
> The key is that the entire operation either has to be fast
> enough so that no connection or application session layer
> time-outs occur, or an end-to-end agreement to suspend the
> connetion is a requirement. The first option seems more
> plausible to me, the second essentially reuqires extending
> the CM protocol. That's a tall order even for InfiniBand,
> and it's even worse for iWARP where the CM functionality
> typically ends when the connection is established.

I'll buy the good network design argument.

I suppose if the kernel wants to revoke a card's pinned memory, we
should be able to guarantee that it gets new pinned memory within a
bounded time. What sort of timing do we need? Milliseconds?
Microseconds?

In the case of iWarp, isn't this just TCP underneath? If so, can't we
just drop any packets in the pipe on the floor and let them get
retransmitted? (I suppose the same argument goes for infiniband..
what sort of a time window do we have for retransmission?)

What are the limits on end-to-end flow control in IB and iWarp?

> 
> 
> 
> > There's always going to be a limit on how much memory you can 
> > mlock. And right now the only option the kernel has for 
> > unlocking that memory is to kill the application. I think 
> > there's got to be a reasonable way to deal with this that 
> > doesn't make the application responsible for everything in 
> > the world. We don't want to have to rewrite every RDMA 
> > application to be able to support memory hotplug. This is an 
> > obvious layer that can and should be abstracted by the kernel.
> >
> 
> Yes, there are limits on how much memory you can mlock, or
> even allocate. Applications are required to reqister memory
> precisely because the required guarantess are not there by
> default. Eliminating those guarantees *is* effectively
> rewriting every RDMA application without even letting
> them know.

Some of this argument is a policy issue, which I would argue shouldn't
be hard-coded in the code or in the network hardware.

At least in my view, the guarantees are only there to make applications
go fast. We are getting low latency and high performance with infiniband
by making memory registration go really really slow. If, to make big HPC
simulation applications work, we wind up doing memcpy() to put the data
into a registered buffer because we can't register half of physical
memory, the application isn't going very fast.