[ofiwg] Feedback about libfabric related to block storage

Tue Oct 7 05:44:58 PDT 2014

Hello Paul and Sean,

Last April, during the InfiniBand Developers Workshop, I promised to
provide feedback about what could be improved in the RDMA API from the
point of view of a block storage driver implementor. Sorry that it took
so long before I provided feedback. My comments about the current RDMA
API are as follows:
* Several API's in the Linux kernel pass a scatterlist to a block driver
or SCSI LLD (struct scatterlist). Such a scatterlist consists of one or
more discontiguous memory regions. As an example, the data associated
with a READ or WRITE request is passed as a scatterlist to block
drivers. However, all RDMA memory registration primitives I am familiar
with support registration of a single contiguous virtual memory region.
It is not always possible to map a Linux scatterlist onto a single
contiguous virtual memory region. Some RDMA API's, e.g. the recently
added API's for T10-PI only accept a single memory key. Similarly, in
the header of certain protocols, e.g. iSER, only one memory key can be
stored. Hence the presence of code for copying discontiguous
scatterlists into a contiguous buffer in block storage initiator drivers
that use the RDMA API. I see this as a mismatch between the capabilities
of the Linux kernel and the RDMA API. My proposal is to address this by
modifying the RDMA API such that registration of discontiguous memory
regions via a single memory key becomes possible. This will eliminate
the need for data copying in RDMA block storage drivers.
* Ensuring that a block driver processes I/O requests with a minimal
response time if the queue depth is low and with optimal bandwidth if
the queue depth is high requires processing most of the I/O request in
atomic (interrupt or tasklet) context for low queue depth and processing
I/O requests via a multithreaded pipeline if the queue depth is high.
Dynamically switching between these two modes is only possible without
disabling interrupts for an unacceptably long time if ib_req_notify_cq()
always returns a value <= 0. This is because most storage protocols
require that RDMA completions are processed in order. If
ib_req_notify_cq() can be called both from atomic context and from
thread context and if an ib_req_notify_cq() call from thread context
returns a value > 0 then it is not possible to guarantee that RDMA
completions are processed in order without adding IRQ-safe locking
around each ib_req_notify_cq() call. Hence the request to modify the
behavior of ib_req_notify_cq() such that it always returns a value <= 0.

I am aware that firmware changes may be required in order to realize
these API changes.

Best regards,

Bart.

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).