[openib-general] Getting rid of pinned memory requirement

Mon Mar 14 16:33:19 PST 2005

> 
> While hardware designers may like this idea, I would like to 
> make the point that if you want the application to 
> *absolutely* control the availability of physical memory, you 
> shouldn't be writing userspace applications that run on Linux.
> 

This is not just a hardware design issue. It is fundamental to
why RDMA is able to optimize end-to-end traffic flow. The application
is directly advertising the availability of buffers (through RKeys)
to the other side. It is bad network engineering for the kernel
to revoke that good faith advertisement and count on the HCA/RNIC
to say "oops" when the data does arrive but the targeted buffer
is not in memory.

But that does not mean that you cannot design mechanisms below
the application to allow the kernel to re-organize physical
memory -- it just means that the kernel had best not be playing
overcommit tricks behind the applications back.

To use a banking analogy, an adverised RKey is like a certified
check. The application has sent this RKey to its peer, and it
expects the HCA/RNIC to honor that check when RDMA Writes are
made to that memory.  But just as a bank does not have to 
guarantee in advance which specific bills will be used to
cash a guaranteed check, there is nothing to say that the
virtual to physical mappings are permanent and immutable.

It would be possible to design an interface that allowed
the kernel to:

a)	suspend the use of a memory region.
		1) outputs referencing the suspend LKey would be
			temporarily held by the HCA/RNIC.
		2) inputs referencing the suspend memory region
			would be delayed (RNR NAK, internal buffers,
			etc.)
		3) possibly ask the peer to similarly suspend
			sending. This is trickier though.
b)	Update the virtual to physical mappings, or at least
	provide the RDMA layer with "physical page X replaced
	by physical page Y".
c)	unsuspend the memory region.

The key is that the entire operation either has to be fast
enough so that no connection or application session layer
time-outs occur, or an end-to-end agreement to suspend the
connetion is a requirement. The first option seems more
plausible to me, the second essentially reuqires extending
the CM protocol. That's a tall order even for InfiniBand,
and it's even worse for iWARP where the CM functionality
typically ends when the connection is established.

> There's always going to be a limit on how much memory you can 
> mlock. And right now the only option the kernel has for 
> unlocking that memory is to kill the application. I think 
> there's got to be a reasonable way to deal with this that 
> doesn't make the application responsible for everything in 
> the world. We don't want to have to rewrite every RDMA 
> application to be able to support memory hotplug. This is an 
> obvious layer that can and should be abstracted by the kernel.
>

Yes, there are limits on how much memory you can mlock, or
even allocate. Applications are required to reqister memory
precisely because the required guarantess are not there by
default. Eliminating those guarantees *is* effectively
rewriting every RDMA application without even letting
them know.