[openib-general] RDMA connection and address translation API

Guy German guyg at voltaire.com
Wed Aug 24 04:41:58 PDT 2005


Hi,

- Here is a header file for cm abstraction API proposition.
- This is just a preliminary suggestion, for review.
- All comments are welcome.
- Please read the notes in the header remarks
- I am attaching the file and will send it later in a different message,
to the list.
- I think that the ib_ prefix should be changed to rdma_, but that
should be done for the rest of the verbs as well, if we are claiming
that the ib verbs abstract iwarp.
- I think that the main difference between the 2 propositions is the
question of whether or not to expose the consumer to the address
resolution. I believe this suggestion (of covering it in the cma) is
simpler, because it saves unnecessary upcall handling for the consumer.
In any case - I don't believe this is clear cut, and would like to hear
other opinions from people on the list.
- Also please see my embedded answer to this mail


Thanks,
Guy.

> We already discussed the problem with having the listen callback pass
> the consumer a remote source address -- doing this requires the
> connection handling module to do an ATS reverse lookup in the IB case,
> which the consumer might not want.  I think there's agreement that the
> correct thing here is for the listen callback to pass a transport
> address to the consumer and provide a function that the consumer can
> call to perform an ATS reverse lookup if desired.  This isn't a major
> problem and can be dealt with.

I agree. This is corrected in the current suggestion

> However, there's another problem with trying to lump address
> translation and connection into a single "connect" call, and this
> problem looks fundamental and fatal to me.  The connect call takes a
> QP pointer, but to create a QP the consumer needs to know which local
> device to use.  However, the consumer doesn't know which device to use
> until the destination address has been resolved to a route, including
> a local interface.

The proposition, also presented (I beleive) in the OpenIB workshop,
include a function called ib_cma_get_device, that retrieves the device
(for qp creation purposes) according to the destination address and the
local routing table. This is done synchronously, and it is implemented
today in the at module. If using link-local IPv6 addresses, I think that
this function isn't even necessary (If I understand it correctly - you
need to know which device to get out from).

> As far as I can tell, kDAPL punts on this and simply requires the
> consumer to handle the route lookup itself before calling
> dat_ep_connect().  It seems that current kDAPL consumers similarly
> punt on this issue: the iSER initiator and the NFS-RDMA client both
> just use a single device which is statically discovered at init time.
> 
> It seems that the kDAPL connection model has a serious flaw, in that
> it pushes the complexity of route lookup into the consumer.  Further,
> we have strong evidence that this routing code is hard to write and
> that consumers will just ignore this complexity and hard-code
> solutions that don't work under all configurations.
> With this in mind, I believe that the connection API needs to be
> something more like the following:
> 
>     rdma_resolve_address():
>         inputs: dest IP address, qos, npaths,
>             done callback, opaque context
> 	done callback params: status, local RDMA device,
>             RDMA transport address, context
> 
>         This function starts the process of resolving an IP address to
>         an RDMA device and address.  When the resolution is complete,
>         the callback is called with a status.  If the status is
>         "success" then the callback also gets the device pointer and
>         transport address (as well as the original context that the
>         consumer passed in).

In the address resolution you have 2 upcalls (from ip to gid and from
gid to path). So, if you are already covering one upcall in the cma, why
not cover both ?

>         The "RDMA transport address" type is a union containing
>         transport-dependent data.  In the IB case, it's all of the
>         SGID, DGID, SLID, DLID, SL etc. that we know and love.  In the
>         iWARP case, it's the source IP, destination IP and QOS.
> 
>         npaths can be either 1 or 2 in the IB case; if it's 2, then
>         the resolver will try to find a primary and alternate path for
>         APM.  In the iWARP case, I guess npaths will always be 1, and
>         I guess anyone who wants to use iWARP over multihomed SCTP
>         will probably have to use some lower-level API.
> 
>         By the way, we may also have to have the option of passing in
>         a local netdev so that we can handle link-local IPv6
>         addresses.  There may be other cases I haven't thought of yet.
>         I just hope we can avoid going all the way to the horror of
>         the getaddrinfo() API.
> 
>         I also hope we can agree to use IPoIB ARP to resolve the
>         address in the IB case; having a flag or some other hack in
>         the API to expose the option of ATS seems unacceptably ugly.
> 
>     rdma_connect():
>         inputs: local QP, RDMA transport address, destination service,
>             private data, timeout, event callback, opaque context
> 
>         This function takes the resolved address and actually connects.
> 
>         I'm not sure how we want to abstract the IB service vs. iWARP
>         TCP port number difference.  I guess it's OK to have iWARP
>         consumers stick their (16-bit) port number in a 64-bit
>         parameter, even if it's not the prettiest API.
> 
> To head off the knee-jerk objection: this API does NOT require any
> transport-specific code in consumers (unless a particular consumer
> WANTS to look inside the RDMA transport address).  Code to connect
> would be as simple as:
> 
>     rdma_resolve_address(...);
>     /* wait for resolution */
>     ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() */
>     rdma_connect(...); /* pass transport address we got from rdma_resolve_address() */
>     /* wait for connection to finish... */


Wouldn't it be simpler (for the consumer) to do:

	resolve_device_by_destip();
	/* don't wait */
	ib_create_qp(...) /* use device pointer we got */
	rdma_connect(dest_ip); /* cma resolution implementation for ib*/
	/* wait for at + connection to finish... */

I think this flow is also more "iwarp friendly" - saves them the
asynchronic rdma_resolve_address wait.

> 
> The listen side is even simpler:
> 
>     rdma_listen():
>         inputs: local service, event callback, consumer context
> 
>         Wait for connection requests and pass events to the consumer's
>         callback.  I'm not sure if/home we want to support binding to
>         a particular IP address.  The current IB CM in Linux doesn't
>         support binding a listen to a single device or port, and even
>         if it did it's not clear how to handle binding to one IP
>         address when a port has more than one IP.
>         I guess the event callback would receive a device pointer and
>         the same RDMA transport address union I talked about above
>         when discussing address resolution.
> 
>         It would be possible to have another function like
>         rdma_getpeername() that takes the transport address and
>         returns a source IP address.  In the IB case this would do an
>         ATS reverse lookup.  However, I hate this idea.  iSER already
>         uses the CM private data to pass the source IP in the IB case,
>         and I would much rather fix NFS/RDMA to do the same thing (so
>         we can just kill ATS as an address resolution method).

>  - R.



-------------- next part --------------
A non-text attachment was scrubbed...
Name: ib_cma.h
Type: text/x-chdr
Size: 7360 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050824/c98aa1c2/attachment.h>


More information about the general mailing list