[openib-general] RDMA connection and address translation API
Tom Tucker
tom at ammasso.com
Wed Aug 24 07:32:21 PDT 2005
Roland:
Steve and I came to the same conclusion on the airplane ride back to
Austin. Whereas plain old TCP/IP selects a device at the bottom of the
stack, RDMA transports must select the device at the top because
pre-connect resources must be allocated and these resouces are
associated with a particular device.
I think you've absolutely nailed the active side (by the way, I think
the ib_at_route_by_ip service already performs the necessary routing
function). The listen side, however, I think needs a little tweaking. It
would be beneficial if the client can specify either an IP address and
port to listen on (effectively selecting a particular device), or a wild
card (all RDMA devices). An NFS server is an example of the later. This
is trivial to do by providing an address to the listen call where a '0'
represents a wild card.
> -----Original Message-----
> From: openib-general-bounces at openib.org
> [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier
> Sent: Wednesday, August 24, 2005 12:07 AM
> To: openib-general at openib.org
> Subject: [openib-general] RDMA connection and address translation API
>
> At the OpenIB workshop on Monday, we had some discussion
> about a high-level transport-neutral API for connection
> handling. After giving the topic some more thought, I've
> come to the conclusion that neither the kDAPL API nor the new
> API that was presented are usable.
> In this email, I'll try to detail my reasoning and sketch
> what I believe is the correct API.
>
> The new API that we looked at was essentially the following
> (I'm recreating this from memory, so I apologize if I
> misrepresent it):
>
> listen(local_ip_address, service_id, listen_callback)
> connect(local_qp, remote_ip_address, qos, service_id,
> private_data, connect_callback)
>
> We already discussed the problem with having the listen
> callback pass the consumer a remote source address -- doing
> this requires the connection handling module to do an ATS
> reverse lookup in the IB case, which the consumer might not
> want. I think there's agreement that the correct thing here
> is for the listen callback to pass a transport address to the
> consumer and provide a function that the consumer can call to
> perform an ATS reverse lookup if desired. This isn't a major
> problem and can be dealt with.
>
> However, there's another problem with trying to lump address
> translation and connection into a single "connect" call, and
> this problem looks fundamental and fatal to me. The connect
> call takes a QP pointer, but to create a QP the consumer
> needs to know which local device to use. However, the
> consumer doesn't know which device to use until the
> destination address has been resolved to a route, including a
> local interface.
>
> As far as I can tell, kDAPL punts on this and simply requires
> the consumer to handle the route lookup itself before calling
> dat_ep_connect(). It seems that current kDAPL consumers
> similarly punt on this issue: the iSER initiator and the
> NFS-RDMA client both just use a single device which is
> statically discovered at init time.
>
> It seems that the kDAPL connection model has a serious flaw,
> in that it pushes the complexity of route lookup into the
> consumer. Further, we have strong evidence that this routing
> code is hard to write and that consumers will just ignore
> this complexity and hard-code solutions that don't work under
> all configurations.
>
> With this in mind, I believe that the connection API needs to
> be something more like the following:
>
> rdma_resolve_address():
> inputs: dest IP address, qos, npaths,
> done callback, opaque context
> done callback params: status, local RDMA device,
> RDMA transport address, context
>
> This function starts the process of resolving an IP address to
> an RDMA device and address. When the resolution is complete,
> the callback is called with a status. If the status is
> "success" then the callback also gets the device pointer and
> transport address (as well as the original context that the
> consumer passed in).
>
> The "RDMA transport address" type is a union containing
> transport-dependent data. In the IB case, it's all of the
> SGID, DGID, SLID, DLID, SL etc. that we know and love. In the
> iWARP case, it's the source IP, destination IP and QOS.
>
> npaths can be either 1 or 2 in the IB case; if it's 2, then
> the resolver will try to find a primary and alternate path for
> APM. In the iWARP case, I guess npaths will always be 1, and
> I guess anyone who wants to use iWARP over multihomed SCTP
> will probably have to use some lower-level API.
>
> By the way, we may also have to have the option of passing in
> a local netdev so that we can handle link-local IPv6
> addresses. There may be other cases I haven't thought of yet.
> I just hope we can avoid going all the way to the horror of
> the getaddrinfo() API.
>
> I also hope we can agree to use IPoIB ARP to resolve the
> address in the IB case; having a flag or some other hack in
> the API to expose the option of ATS seems unacceptably ugly.
>
> rdma_connect():
> inputs: local QP, RDMA transport address, destination service,
> private data, timeout, event callback, opaque context
>
> This function takes the resolved address and actually
> connects.
>
> I'm not sure how we want to abstract the IB service vs. iWARP
> TCP port number difference. I guess it's OK to have iWARP
> consumers stick their (16-bit) port number in a 64-bit
> parameter, even if it's not the prettiest API.
>
> To head off the knee-jerk objection: this API does NOT
> require any transport-specific code in consumers (unless a
> particular consumer WANTS to look inside the RDMA transport
> address). Code to connect would be as simple as:
>
> rdma_resolve_address(...);
> /* wait for resolution */
> ib_create_qp(...) /* use device pointer we got from
> rdma_resolve_address() */
> rdma_connect(...); /* pass transport address we got from
> rdma_resolve_address() */
> /* wait for connection to finish... */
>
> The listen side is even simpler:
>
> rdma_listen():
> inputs: local service, event callback, consumer context
>
> Wait for connection requests and pass events to the consumer's
> callback. I'm not sure if/home we want to support binding to
> a particular IP address. The current IB CM in Linux doesn't
> support binding a listen to a single device or port, and even
> if it did it's not clear how to handle binding to one IP
> address when a port has more than one IP.
>
> I guess the event callback would receive a device pointer and
> the same RDMA transport address union I talked about above
> when discussing address resolution.
>
> It would be possible to have another function like
> rdma_getpeername() that takes the transport address and
> returns a source IP address. In the IB case this would do an
> ATS reverse lookup. However, I hate this idea. iSER already
> uses the CM private data to pass the source IP in the IB case,
> and I would much rather fix NFS/RDMA to do the same thing (so
> we can just kill ATS as an address resolution method).
>
> - R.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list