[openib-general] RDMA connection and address translation API

Wed Aug 24 07:32:21 PDT 2005

Roland:

Steve and I came to the same conclusion on the airplane ride back to
Austin. Whereas plain old TCP/IP selects a device at the bottom of the
stack, RDMA transports must select the device at the top because
pre-connect resources must be allocated and these resouces are
associated with a particular device.

I think you've absolutely nailed the active side (by the way, I think
the ib_at_route_by_ip service already performs the necessary routing
function). The listen side, however, I think needs a little tweaking. It
would be beneficial if the client can specify either an IP address and
port to listen on (effectively selecting a particular device), or a wild
card (all RDMA devices). An NFS server is an example of the later. This
is trivial to do by providing an address to the listen call where a '0'
represents a wild card.

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier
> Sent: Wednesday, August 24, 2005 12:07 AM
> To: openib-general at openib.org
> Subject: [openib-general] RDMA connection and address translation API
> 
> At the OpenIB workshop on Monday, we had some discussion 
> about a high-level transport-neutral API for connection 
> handling.  After giving the topic some more thought, I've 
> come to the conclusion that neither the kDAPL API nor the new 
> API that was presented are usable.
> In this email, I'll try to detail my reasoning and sketch 
> what I believe is the correct API.
> 
> The new API that we looked at was essentially the following 
> (I'm recreating this from memory, so I apologize if I 
> misrepresent it):
> 
>     listen(local_ip_address, service_id, listen_callback)
>     connect(local_qp, remote_ip_address, qos, service_id,
>             private_data, connect_callback)
> 
> We already discussed the problem with having the listen 
> callback pass the consumer a remote source address -- doing 
> this requires the connection handling module to do an ATS 
> reverse lookup in the IB case, which the consumer might not 
> want.  I think there's agreement that the correct thing here 
> is for the listen callback to pass a transport address to the 
> consumer and provide a function that the consumer can call to 
> perform an ATS reverse lookup if desired.  This isn't a major 
> problem and can be dealt with.
> 
> However, there's another problem with trying to lump address 
> translation and connection into a single "connect" call, and 
> this problem looks fundamental and fatal to me.  The connect 
> call takes a QP pointer, but to create a QP the consumer 
> needs to know which local device to use.  However, the 
> consumer doesn't know which device to use until the 
> destination address has been resolved to a route, including a 
> local interface.
> 
> As far as I can tell, kDAPL punts on this and simply requires 
> the consumer to handle the route lookup itself before calling 
> dat_ep_connect().  It seems that current kDAPL consumers 
> similarly punt on this issue: the iSER initiator and the 
> NFS-RDMA client both just use a single device which is 
> statically discovered at init time.
> 
> It seems that the kDAPL connection model has a serious flaw, 
> in that it pushes the complexity of route lookup into the 
> consumer.  Further, we have strong evidence that this routing 
> code is hard to write and that consumers will just ignore 
> this complexity and hard-code solutions that don't work under 
> all configurations.
> 
> With this in mind, I believe that the connection API needs to 
> be something more like the following:
> 
>     rdma_resolve_address():
>         inputs: dest IP address, qos, npaths,
>             done callback, opaque context
> 	done callback params: status, local RDMA device,
>             RDMA transport address, context
> 
>         This function starts the process of resolving an IP address to
>         an RDMA device and address.  When the resolution is complete,
>         the callback is called with a status.  If the status is
>         "success" then the callback also gets the device pointer and
>         transport address (as well as the original context that the
>         consumer passed in).
> 
>         The "RDMA transport address" type is a union containing
>         transport-dependent data.  In the IB case, it's all of the
>         SGID, DGID, SLID, DLID, SL etc. that we know and love.  In the
>         iWARP case, it's the source IP, destination IP and QOS.
> 
>         npaths can be either 1 or 2 in the IB case; if it's 2, then
>         the resolver will try to find a primary and alternate path for
>         APM.  In the iWARP case, I guess npaths will always be 1, and
>         I guess anyone who wants to use iWARP over multihomed SCTP
>         will probably have to use some lower-level API.
> 
>         By the way, we may also have to have the option of passing in
>         a local netdev so that we can handle link-local IPv6
>         addresses.  There may be other cases I haven't thought of yet.
>         I just hope we can avoid going all the way to the horror of
>         the getaddrinfo() API.
> 
>         I also hope we can agree to use IPoIB ARP to resolve the
>         address in the IB case; having a flag or some other hack in
>         the API to expose the option of ATS seems unacceptably ugly.
> 
>     rdma_connect():
>         inputs: local QP, RDMA transport address, destination service,
>             private data, timeout, event callback, opaque context
> 
>         This function takes the resolved address and actually 
> connects.
> 
>         I'm not sure how we want to abstract the IB service vs. iWARP
>         TCP port number difference.  I guess it's OK to have iWARP
>         consumers stick their (16-bit) port number in a 64-bit
>         parameter, even if it's not the prettiest API.
> 
> To head off the knee-jerk objection: this API does NOT 
> require any transport-specific code in consumers (unless a 
> particular consumer WANTS to look inside the RDMA transport 
> address).  Code to connect would be as simple as:
> 
>     rdma_resolve_address(...);
>     /* wait for resolution */
>     ib_create_qp(...) /* use device pointer we got from 
> rdma_resolve_address() */
>     rdma_connect(...); /* pass transport address we got from 
> rdma_resolve_address() */
>     /* wait for connection to finish... */
> 
> The listen side is even simpler:
> 
>     rdma_listen():
>         inputs: local service, event callback, consumer context
> 
>         Wait for connection requests and pass events to the consumer's
>         callback.  I'm not sure if/home we want to support binding to
>         a particular IP address.  The current IB CM in Linux doesn't
>         support binding a listen to a single device or port, and even
>         if it did it's not clear how to handle binding to one IP
>         address when a port has more than one IP.
> 
>         I guess the event callback would receive a device pointer and
>         the same RDMA transport address union I talked about above
>         when discussing address resolution.
> 
>         It would be possible to have another function like
>         rdma_getpeername() that takes the transport address and
>         returns a source IP address.  In the IB case this would do an
>         ATS reverse lookup.  However, I hate this idea.  iSER already
>         uses the CM private data to pass the source IP in the IB case,
>         and I would much rather fix NFS/RDMA to do the same thing (so
>         we can just kill ATS as an address resolution method).
> 
>  - R.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
>