[ofa-general] Directions for verbs API extensions
Steve Wise
swise at opengridcomputing.com
Mon Apr 7 08:27:28 PDT 2008
Hey roland. Nice write-up. Comments in-line below:
Roland Dreier wrote:
> Here is a little document I wrote trying to summarize all the things
> that we might want to add to the verbs API to support device
> capabilities that aren't exposed yet. There are a number of issues to
> resolve, and answers to the questions I ask below would help us make
> progress towards actually supporting all this.
>
> There are a number of verbs that are common to the iWARP/RDMA
> consortium verbs and the InfiniBand base memory management extensions
> (IB-BMME). We would probably add one device capability bit for "BMME"
> (and all iWARP devices could set it) to show support for everything here:
>
> - Allocate L_Key/STag. This allocates MR resources without actually
> registering memory; the MR can then be registered or invalidated as
> described below.
>
> - "Fast register" memory through send queue. This allows a work
> request to be posted to a send queue to register memory using an
> L_Key/STag that is in the invalid state.
>
> - Local invalidate send work requests, which can be used to
> invalidate an MR or MW. One subtle point here is that local
> invalidate operations have very loose ordering, in the sense that
> they can be executed before earlier requests, but support for
> fencing local invalidate operations is mandatory in iWARP and only
> optional in IB. But is there any IB device that currently exists
> that supports BMME but doesn't support local invalidate fencing?
> I really hope we can ignore this possibility.
>
> - Memory windows associated to a single QP and bound using send work
> requests posted with the normal post send verb rather than a
> separate MW verb. (See below for more)
>
> In addition there are things that are optional in both specs:
>
> - Block-list physical buffer lists; this allows memory regions to be
> registered with arbitrary size/alignment blocks instead of just
> page-aligned chunks. Yet another capability bit if we want to
> expose this.
>
> There are a few discrepancies between the iWARP and IB verbs that we
> need to decide on how we want to handle:
>
> - In IB-BMME, L_Keys and R_Keys are split up so that there is an
> 8-bit "key" that is owned by the consumer. As far as I know, there
> is no analogous concept defined for iWARP STags; is there any point
> in supporting this IB-only feature (which is optional even in the
> IB spec)?
>
>
In fact there is an 8b key for stags as well. The stag is composed of a
3B index allocated by the driver/hw, and a 1B key specified by the
consumer. None of this is exposed in the linux rdma interface at this
point and cxgb3 always sets the key to 0xff.
> - Along similar lines, IB defines two types of memory windows, "type
> 1" and "type 2" and in fact type 2 is split into "2A" and "2B" (the
> difference is basically whether the MW is associated with just a
> QP, or with a QP and a PD). iWARP memory windows are always what
> the IB spec would call type 2B. All the IB devices that I know of
> with IB-BMME support can handle type 2B memory windows. Is there
> any point in having our API worry about the distinction between 2A
> or 2B, or should we just decree that we only handle type 2B? (Does
> anyone who hasn't just been reading specs even understand the
> distinction between type 2A and 2B?)
>
> - Further, the MW API that we have now, with a separate bind MW verb,
> corresponds to type 1 MWs. Type 2 MWs are bound by posting a work
> request using the standard "post send" verb. Given that no IB
> device drivers have implemented the bind MW verb yet, does it make
> sense to deprecate the API for type 1 MWs and say that everyone
> should use type 2[B] MWs only?
>
>
The chelsio driver supports the iwarp bind_mw SQ WR via the current API.
In fact the current API implies that this call is actually a SQ
operation anyway:
> /**
> * ib_bind_mw - Posts a work request to the send queue of the specified
> * QP, which binds the memory window to the given address range and
> * remote access attributes.
How is the current bind_mw API not valid or correct for iwarp MWs? Other
than being a different call than ib_post_send()?
> - iWARP supports "RDMA read with invalidate" send work requests,
> while IB has no such operation. This makes sense because iWARP
> requires the buffer used to receive RDMA read responses to have
> remote write permission, while IB has no such requirement. I don't
> see a really clean way to handle this except to say that apps have
> to have "if (IB) do_this(); else /* iWARP */ do_that();" code to
> use this in a portable way.
>
Or a transport independent app can always use 2 WRs, read +
inv-local-stag/fenced instead of read-inv-local-stag.
> - Zero-based virtual addresses for memory regions. This is mandatory
> for iWARP and optional for IB (and is not required even for BMME).
> I think the simplest thing to do is just to have yet another
> capability bit to say whether a device supports ZBVA or not; all
> iWARP devices can set it.
>
>
Currently, nobody is using this nor the block mode feature. I don't
think we should bother supporting them unless someone has an app in mind
that will utilize them.
> Finally, there are proprietary verbs extensions that are only
> supported by a single device at the moment, which we have to decide if
> and how to support. It is a tradeoff between making useful features
> available versus making the already overly complex verbs API even more
> impossible to fathom, although it seems all of these have users asking
> for them:
>
> - ConnectX has XRC, masked atomic operations, and the "block
> loopback" flag for UD QPs at least.
>
> - eHCA has "low-latency" QPs.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list