[openib-general] mthca FMR correctness (and memory windows)

Mon Mar 20 14:34:38 PST 2006

At 05:09 PM 3/20/2006, Dror Goldenberg wrote:
>It's not exactly the same. The important difference is about
>scatter/gather.
>If you use dma_mr, then you have to send a chunk list from the client to
>the server. Then, for each one of the chunks, the server has to post an
>RDMA read or write WQE. Also, the typical message size on the wire
>will be a page (I am assuming large IOs for the purpose of this
>discussion).

Yes, of course that is a consideration. The RPC/RDMA protocol carries
many more "chunks" for NFS_READ and NFS_WRITE RPCs in this mode.
But, the performance is still excellent, because the server can stream
RDMA Writes and/or RDMA Reads to and from the chunklists in response.

Since NFS clients typically use 32KB or 64KB sizes, such chunklists are
typically 8 or 16 elements, for which the client offers large numbers of
rdma read responder resources. Along with large numbers of RPC/RDMA
operation credits. In a typical read or write burst, I have seen the
Linux client have 10 or 20 RPC operations outstanding, each with
8 or 16 RDMA operations and two sends for the request/response.
In full transactional workloads, I have seen over a hundred RPCs.
It's pretty impressive on an analyzer.

>Alternatively, if you use FMR, you can take the list of pages, the IO is
>comprised of, collapse them into a virtually contiguous memory region,
>and use just one chunk for the IO.
>This:
>- Reduces the amount of WQEs that need to be posted per IO operation
>	* lower CPU utilization
>- Reduces the amount of messages on the wire and increases their sizes
>	* better HCA performance

It's all relative! And most definitely not a zero-sum game. Another way
of looking at it:

If the only way to get fewer messages is to incur more client overhead,
it's (probably) a bad trade. Besides, we're nowhere near the op rate of
your HCA with most storage workloads. So it's an even better strategy
to just put the work on the wire asap. Then, the throughput simply
scales (rises) with demand.

This, by the way, is why the fencing behavior of memory windows is so
painful. I would much rather take an interrupt on bind completion than
fence the entire send queue. But there isn't a standard way to do that,
even in iWARP. Sigh.

Tom.