[openib-general] RDMA connection and address translation API
Steve Wise
swise at ammasso.com
Wed Aug 24 07:59:12 PDT 2005
Roland, this looks good! A few comments below...
> -----Original Message-----
> From: openib-general-bounces at openib.org
> [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier
> Sent: Wednesday, August 24, 2005 12:07 AM
> To: openib-general at openib.org
> Subject: [openib-general] RDMA connection and address translation API
>
> At the OpenIB workshop on Monday, we had some discussion about a
> high-level transport-neutral API for connection handling. After
> giving the topic some more thought, I've come to the conclusion that
> neither the kDAPL API nor the new API that was presented are usable.
> In this email, I'll try to detail my reasoning and sketch what I
> believe is the correct API.
>
> The new API that we looked at was essentially the following (I'm
> recreating this from memory, so I apologize if I misrepresent it):
>
> listen(local_ip_address, service_id, listen_callback)
> connect(local_qp, remote_ip_address, qos, service_id,
> private_data, connect_callback)
>
> We already discussed the problem with having the listen callback pass
> the consumer a remote source address -- doing this requires the
> connection handling module to do an ATS reverse lookup in the IB case,
> which the consumer might not want. I think there's agreement that the
> correct thing here is for the listen callback to pass a transport
> address to the consumer and provide a function that the consumer can
> call to perform an ATS reverse lookup if desired. This isn't a major
> problem and can be dealt with.
>
> However, there's another problem with trying to lump address
> translation and connection into a single "connect" call, and this
> problem looks fundamental and fatal to me. The connect call takes a
> QP pointer, but to create a QP the consumer needs to know which local
> device to use. However, the consumer doesn't know which device to use
> until the destination address has been resolved to a route, including
> a local interface.
>
> As far as I can tell, kDAPL punts on this and simply requires the
> consumer to handle the route lookup itself before calling
> dat_ep_connect(). It seems that current kDAPL consumers similarly
> punt on this issue: the iSER initiator and the NFS-RDMA client both
> just use a single device which is statically discovered at init time.
>
Yes, DAPL punts on this.
> It seems that the kDAPL connection model has a serious flaw, in that
> it pushes the complexity of route lookup into the consumer. Further,
> we have strong evidence that this routing code is hard to write and
> that consumers will just ignore this complexity and hard-code
> solutions that don't work under all configurations.
>
I agree!
> With this in mind, I believe that the connection API needs to be
> something more like the following:
>
> rdma_resolve_address():
> inputs: dest IP address, qos, npaths,
> done callback, opaque context
> done callback params: status, local RDMA device,
> RDMA transport address, context
>
> This function starts the process of resolving an IP address to
> an RDMA device and address. When the resolution is complete,
> the callback is called with a status. If the status is
> "success" then the callback also gets the device pointer and
> transport address (as well as the original context that the
> consumer passed in).
>
> The "RDMA transport address" type is a union containing
> transport-dependent data. In the IB case, it's all of the
> SGID, DGID, SLID, DLID, SL etc. that we know and love. In the
> iWARP case, it's the source IP, destination IP and QOS.
>
> npaths can be either 1 or 2 in the IB case; if it's 2, then
> the resolver will try to find a primary and alternate path for
> APM. In the iWARP case, I guess npaths will always be 1, and
> I guess anyone who wants to use iWARP over multihomed SCTP
> will probably have to use some lower-level API.
>
> By the way, we may also have to have the option of passing in
> a local netdev so that we can handle link-local IPv6
> addresses. There may be other cases I haven't thought of yet.
> I just hope we can avoid going all the way to the horror of
> the getaddrinfo() API.
>
> I also hope we can agree to use IPoIB ARP to resolve the
> address in the IB case; having a flag or some other hack in
> the API to expose the option of ATS seems unacceptably ugly.
>
> rdma_connect():
> inputs: local QP, RDMA transport address, destination service,
> private data, timeout, event callback, opaque context
>
> This function takes the resolved address and actually
> connects.
>
> I'm not sure how we want to abstract the IB service vs. iWARP
> TCP port number difference. I guess it's OK to have iWARP
> consumers stick their (16-bit) port number in a 64-bit
> parameter, even if it's not the prettiest API.
>
> To head off the knee-jerk objection: this API does NOT require any
> transport-specific code in consumers (unless a particular consumer
> WANTS to look inside the RDMA transport address). Code to connect
> would be as simple as:
>
> rdma_resolve_address(...);
> /* wait for resolution */
> ib_create_qp(...) /* use device pointer we got from
> rdma_resolve_address() */
> rdma_connect(...); /* pass transport address we got from
> rdma_resolve_address() */
> /* wait for connection to finish... */
>
> The listen side is even simpler:
>
> rdma_listen():
> inputs: local service, event callback, consumer context
>
> Wait for connection requests and pass events to the consumer's
> callback. I'm not sure if/home we want to support binding to
> a particular IP address. The current IB CM in Linux doesn't
> support binding a listen to a single device or port, and even
> if it did it's not clear how to handle binding to one IP
> address when a port has more than one IP.
>
> I guess the event callback would receive a device pointer and
> the same RDMA transport address union I talked about above
> when discussing address resolution.
>
> It would be possible to have another function like
> rdma_getpeername() that takes the transport address and
> returns a source IP address. In the IB case this would do an
> ATS reverse lookup. However, I hate this idea. iSER already
> uses the CM private data to pass the source IP in the IB case,
> and I would much rather fix NFS/RDMA to do the same thing (so
> we can just kill ATS as an address resolution method).
>
I think we should allow an ULP application can listen on a specific IP
address and device, this is done often in servers to limit the scope of
the service. I was thinking such a ULP would simply walk the list of ib
devices and issue a rdma_listen() on each device. But the ULP should
also be able to pass down the local ipaddr upon which the device should
listen. Consider an NFS/RDMA server ULP that has exports that are
limited to a single subnet or interface, etc...
More information about the general
mailing list