[ofa-general] Directions for verbs API extensions

Roland Dreier rdreier at cisco.com
Sat Apr 5 21:41:02 PDT 2008


Here is a little document I wrote trying to summarize all the things
that we might want to add to the verbs API to support device
capabilities that aren't exposed yet.  There are a number of issues to
resolve, and answers to the questions I ask below would help us make
progress towards actually supporting all this.

There are a number of verbs that are common to the iWARP/RDMA
consortium verbs and the InfiniBand base memory management extensions
(IB-BMME).  We would probably add one device capability bit for "BMME"
(and all iWARP devices could set it) to show support for everything here:

 - Allocate L_Key/STag.  This allocates MR resources without actually
   registering memory; the MR can then be registered or invalidated as
   described below.

 - "Fast register" memory through send queue.  This allows a work
   request to be posted to a send queue to register memory using an
   L_Key/STag that is in the invalid state.

 - Local invalidate send work requests, which can be used to
   invalidate an MR or MW.  One subtle point here is that local
   invalidate operations have very loose ordering, in the sense that
   they can be executed before earlier requests, but support for
   fencing local invalidate operations is mandatory in iWARP and only
   optional in IB.  But is there any IB device that currently exists
   that supports BMME but doesn't support local invalidate fencing?
   I really hope we can ignore this possibility.

 - Memory windows associated to a single QP and bound using send work
   requests posted with the normal post send verb rather than a
   separate MW verb.  (See below for more)

In addition there are things that are optional in both specs:

 - Block-list physical buffer lists; this allows memory regions to be
   registered with arbitrary size/alignment blocks instead of just
   page-aligned chunks.  Yet another capability bit if we want to
   expose this.

There are a few discrepancies between the iWARP and IB verbs that we
need to decide on how we want to handle:

 - In IB-BMME, L_Keys and R_Keys are split up so that there is an
   8-bit "key" that is owned by the consumer.  As far as I know, there
   is no analogous concept defined for iWARP STags; is there any point
   in supporting this IB-only feature (which is optional even in the
   IB spec)?

 - Along similar lines, IB defines two types of memory windows, "type
   1" and "type 2" and in fact type 2 is split into "2A" and "2B" (the
   difference is basically whether the MW is associated with just a
   QP, or with a QP and a PD).  iWARP memory windows are always what
   the IB spec would call type 2B.  All the IB devices that I know of
   with IB-BMME support can handle type 2B memory windows.  Is there
   any point in having our API worry about the distinction between 2A
   or 2B, or should we just decree that we only handle type 2B?  (Does
   anyone who hasn't just been reading specs even understand the
   distinction between type 2A and 2B?)

 - Further, the MW API that we have now, with a separate bind MW verb,
   corresponds to type 1 MWs.  Type 2 MWs are bound by posting a work
   request using the standard "post send" verb.  Given that no IB
   device drivers have implemented the bind MW verb yet, does it make
   sense to deprecate the API for type 1 MWs and say that everyone
   should use type 2[B] MWs only?

 - iWARP supports "RDMA read with invalidate" send work requests,
   while IB has no such operation.  This makes sense because iWARP
   requires the buffer used to receive RDMA read responses to have
   remote write permission, while IB has no such requirement.  I don't
   see a really clean way to handle this except to say that apps have
   to have "if (IB) do_this(); else /* iWARP */ do_that();" code to
   use this in a portable way.

 - Zero-based virtual addresses for memory regions.  This is mandatory
   for iWARP and optional for IB (and is not required even for BMME).
   I think the simplest thing to do is just to have yet another
   capability bit to say whether a device supports ZBVA or not; all
   iWARP devices can set it.

Finally, there are proprietary verbs extensions that are only
supported by a single device at the moment, which we have to decide if
and how to support.  It is a tradeoff between making useful features
available versus making the already overly complex verbs API even more
impossible to fathom, although it seems all of these have users asking
for them:

 - ConnectX has XRC, masked atomic operations, and the "block
   loopback" flag for UD QPs at least.

 - eHCA has "low-latency" QPs.



More information about the general mailing list