[ofiwg] Feedback about libfabric related to block storage

Tue Oct 7 06:01:19 PDT 2014

On 10/7/2014 3:44 PM, Bart Van Assche wrote:
> Hello Paul and Sean,
>
> Last April, during the InfiniBand Developers Workshop, I promised to
> provide feedback about what could be improved in the RDMA API from the
> point of view of a block storage driver implementor. Sorry that it took
> so long before I provided feedback. My comments about the current RDMA
> API are as follows:
> * Several API's in the Linux kernel pass a scatterlist to a block driver
> or SCSI LLD (struct scatterlist). Such a scatterlist consists of one or
> more discontiguous memory regions. As an example, the data associated
> with a READ or WRITE request is passed as a scatterlist to block
> drivers. However, all RDMA memory registration primitives I am familiar
> with support registration of a single contiguous virtual memory region.
> It is not always possible to map a Linux scatterlist onto a single
> contiguous virtual memory region. Some RDMA API's, e.g. the recently
> added API's for T10-PI only accept a single memory key. Similarly, in
> the header of certain protocols, e.g. iSER, only one memory key can be
> stored. Hence the presence of code for copying discontiguous
> scatterlists into a contiguous buffer in block storage initiator drivers
> that use the RDMA API. I see this as a mismatch between the capabilities
> of the Linux kernel and the RDMA API. My proposal is to address this by
> modifying the RDMA API such that registration of discontiguous memory
> regions via a single memory key becomes possible. This will eliminate
> the need for data copying in RDMA block storage drivers.

Funny, I have a patchset in my pipe that introduces a new registration
type called "Indirect memory registration" which attacks exactly that!

This is depending on device capability of course (mlx5 and on support
it)...

Allowing this kind of thing on existing devices seems unfeasible to
me as devices page-tables work in block granularity. compensating that
transparently from the user in the device driver would require to
maintain internally several mkeys per region which is never going to
scale.

Let me refurnish my set a bit and post it on the Linux-rdma.

> * Ensuring that a block driver processes I/O requests with a minimal
> response time if the queue depth is low and with optimal bandwidth if
> the queue depth is high requires processing most of the I/O request in
> atomic (interrupt or tasklet) context for low queue depth and processing
> I/O requests via a multithreaded pipeline if the queue depth is high.
> Dynamically switching between these two modes is only possible without
> disabling interrupts for an unacceptably long time if ib_req_notify_cq()
> always returns a value <= 0. This is because most storage protocols
> require that RDMA completions are processed in order. If
> ib_req_notify_cq() can be called both from atomic context and from
> thread context and if an ib_req_notify_cq() call from thread context
> returns a value > 0 then it is not possible to guarantee that RDMA
> completions are processed in order without adding IRQ-safe locking
> around each ib_req_notify_cq() call. Hence the request to modify the
> behavior of ib_req_notify_cq() such that it always returns a value <= 0.
>

For mlx devices this is the case... Can't comment on other devices
though. I'm also for this wish.

Sagi.