[ofa-general] Directions for verbs API extensions

Steve Wise swise at opengridcomputing.com
Mon Apr 7 08:27:28 PDT 2008


Hey roland. Nice write-up. Comments in-line below:

Roland Dreier wrote:
> Here is a little document I wrote trying to summarize all the things
> that we might want to add to the verbs API to support device
> capabilities that aren't exposed yet.  There are a number of issues to
> resolve, and answers to the questions I ask below would help us make
> progress towards actually supporting all this.
>
> There are a number of verbs that are common to the iWARP/RDMA
> consortium verbs and the InfiniBand base memory management extensions
> (IB-BMME).  We would probably add one device capability bit for "BMME"
> (and all iWARP devices could set it) to show support for everything here:
>
>  - Allocate L_Key/STag.  This allocates MR resources without actually
>    registering memory; the MR can then be registered or invalidated as
>    described below.
>
>  - "Fast register" memory through send queue.  This allows a work
>    request to be posted to a send queue to register memory using an
>    L_Key/STag that is in the invalid state.
>
>  - Local invalidate send work requests, which can be used to
>    invalidate an MR or MW.  One subtle point here is that local
>    invalidate operations have very loose ordering, in the sense that
>    they can be executed before earlier requests, but support for
>    fencing local invalidate operations is mandatory in iWARP and only
>    optional in IB.  But is there any IB device that currently exists
>    that supports BMME but doesn't support local invalidate fencing?
>    I really hope we can ignore this possibility.
>
>  - Memory windows associated to a single QP and bound using send work
>    requests posted with the normal post send verb rather than a
>    separate MW verb.  (See below for more)
>
> In addition there are things that are optional in both specs:
>
>  - Block-list physical buffer lists; this allows memory regions to be
>    registered with arbitrary size/alignment blocks instead of just
>    page-aligned chunks.  Yet another capability bit if we want to
>    expose this.
>
> There are a few discrepancies between the iWARP and IB verbs that we
> need to decide on how we want to handle:
>
>  - In IB-BMME, L_Keys and R_Keys are split up so that there is an
>    8-bit "key" that is owned by the consumer.  As far as I know, there
>    is no analogous concept defined for iWARP STags; is there any point
>    in supporting this IB-only feature (which is optional even in the
>    IB spec)?
>
>   
In fact there is an 8b key for stags as well. The stag is composed of a 
3B index allocated by the driver/hw, and a 1B key specified by the 
consumer. None of this is exposed in the linux rdma interface at this 
point and cxgb3 always sets the key to 0xff.

>  - Along similar lines, IB defines two types of memory windows, "type
>    1" and "type 2" and in fact type 2 is split into "2A" and "2B" (the
>    difference is basically whether the MW is associated with just a
>    QP, or with a QP and a PD).  iWARP memory windows are always what
>    the IB spec would call type 2B.  All the IB devices that I know of
>    with IB-BMME support can handle type 2B memory windows.  Is there
>    any point in having our API worry about the distinction between 2A
>    or 2B, or should we just decree that we only handle type 2B?  (Does
>    anyone who hasn't just been reading specs even understand the
>    distinction between type 2A and 2B?)
>
>  - Further, the MW API that we have now, with a separate bind MW verb,
>    corresponds to type 1 MWs.  Type 2 MWs are bound by posting a work
>    request using the standard "post send" verb.  Given that no IB
>    device drivers have implemented the bind MW verb yet, does it make
>    sense to deprecate the API for type 1 MWs and say that everyone
>    should use type 2[B] MWs only?
>
>   
The chelsio driver supports the iwarp bind_mw SQ WR via the current API. 
In fact the current API implies that this call is actually a SQ 
operation anyway:
> /**
> * ib_bind_mw - Posts a work request to the send queue of the specified
> * QP, which binds the memory window to the given address range and
> * remote access attributes.

How is the current bind_mw API not valid or correct for iwarp MWs? Other 
than being a different call than ib_post_send()?


>  - iWARP supports "RDMA read with invalidate" send work requests,
>    while IB has no such operation.  This makes sense because iWARP
>    requires the buffer used to receive RDMA read responses to have
>    remote write permission, while IB has no such requirement.  I don't
>    see a really clean way to handle this except to say that apps have
>    to have "if (IB) do_this(); else /* iWARP */ do_that();" code to
>    use this in a portable way.
>   

Or a transport independent app can always use 2 WRs, read + 
inv-local-stag/fenced instead of read-inv-local-stag.

>  - Zero-based virtual addresses for memory regions.  This is mandatory
>    for iWARP and optional for IB (and is not required even for BMME).
>    I think the simplest thing to do is just to have yet another
>    capability bit to say whether a device supports ZBVA or not; all
>    iWARP devices can set it.
>
>   
Currently, nobody is using this nor the block mode feature. I don't 
think we should bother supporting them unless someone has an app in mind 
that will utilize them.

> Finally, there are proprietary verbs extensions that are only
> supported by a single device at the moment, which we have to decide if
> and how to support.  It is a tradeoff between making useful features
> available versus making the already overly complex verbs API even more
> impossible to fathom, although it seems all of these have users asking
> for them:
>
>  - ConnectX has XRC, masked atomic operations, and the "block
>    loopback" flag for UD QPs at least.
>
>  - eHCA has "low-latency" QPs.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   




More information about the general mailing list