[openib-general] RDMA connection and address translation API
Roland Dreier
rolandd at cisco.com
Tue Aug 23 22:07:07 PDT 2005
At the OpenIB workshop on Monday, we had some discussion about a
high-level transport-neutral API for connection handling. After
giving the topic some more thought, I've come to the conclusion that
neither the kDAPL API nor the new API that was presented are usable.
In this email, I'll try to detail my reasoning and sketch what I
believe is the correct API.
The new API that we looked at was essentially the following (I'm
recreating this from memory, so I apologize if I misrepresent it):
listen(local_ip_address, service_id, listen_callback)
connect(local_qp, remote_ip_address, qos, service_id,
private_data, connect_callback)
We already discussed the problem with having the listen callback pass
the consumer a remote source address -- doing this requires the
connection handling module to do an ATS reverse lookup in the IB case,
which the consumer might not want. I think there's agreement that the
correct thing here is for the listen callback to pass a transport
address to the consumer and provide a function that the consumer can
call to perform an ATS reverse lookup if desired. This isn't a major
problem and can be dealt with.
However, there's another problem with trying to lump address
translation and connection into a single "connect" call, and this
problem looks fundamental and fatal to me. The connect call takes a
QP pointer, but to create a QP the consumer needs to know which local
device to use. However, the consumer doesn't know which device to use
until the destination address has been resolved to a route, including
a local interface.
As far as I can tell, kDAPL punts on this and simply requires the
consumer to handle the route lookup itself before calling
dat_ep_connect(). It seems that current kDAPL consumers similarly
punt on this issue: the iSER initiator and the NFS-RDMA client both
just use a single device which is statically discovered at init time.
It seems that the kDAPL connection model has a serious flaw, in that
it pushes the complexity of route lookup into the consumer. Further,
we have strong evidence that this routing code is hard to write and
that consumers will just ignore this complexity and hard-code
solutions that don't work under all configurations.
With this in mind, I believe that the connection API needs to be
something more like the following:
rdma_resolve_address():
inputs: dest IP address, qos, npaths,
done callback, opaque context
done callback params: status, local RDMA device,
RDMA transport address, context
This function starts the process of resolving an IP address to
an RDMA device and address. When the resolution is complete,
the callback is called with a status. If the status is
"success" then the callback also gets the device pointer and
transport address (as well as the original context that the
consumer passed in).
The "RDMA transport address" type is a union containing
transport-dependent data. In the IB case, it's all of the
SGID, DGID, SLID, DLID, SL etc. that we know and love. In the
iWARP case, it's the source IP, destination IP and QOS.
npaths can be either 1 or 2 in the IB case; if it's 2, then
the resolver will try to find a primary and alternate path for
APM. In the iWARP case, I guess npaths will always be 1, and
I guess anyone who wants to use iWARP over multihomed SCTP
will probably have to use some lower-level API.
By the way, we may also have to have the option of passing in
a local netdev so that we can handle link-local IPv6
addresses. There may be other cases I haven't thought of yet.
I just hope we can avoid going all the way to the horror of
the getaddrinfo() API.
I also hope we can agree to use IPoIB ARP to resolve the
address in the IB case; having a flag or some other hack in
the API to expose the option of ATS seems unacceptably ugly.
rdma_connect():
inputs: local QP, RDMA transport address, destination service,
private data, timeout, event callback, opaque context
This function takes the resolved address and actually connects.
I'm not sure how we want to abstract the IB service vs. iWARP
TCP port number difference. I guess it's OK to have iWARP
consumers stick their (16-bit) port number in a 64-bit
parameter, even if it's not the prettiest API.
To head off the knee-jerk objection: this API does NOT require any
transport-specific code in consumers (unless a particular consumer
WANTS to look inside the RDMA transport address). Code to connect
would be as simple as:
rdma_resolve_address(...);
/* wait for resolution */
ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() */
rdma_connect(...); /* pass transport address we got from rdma_resolve_address() */
/* wait for connection to finish... */
The listen side is even simpler:
rdma_listen():
inputs: local service, event callback, consumer context
Wait for connection requests and pass events to the consumer's
callback. I'm not sure if/home we want to support binding to
a particular IP address. The current IB CM in Linux doesn't
support binding a listen to a single device or port, and even
if it did it's not clear how to handle binding to one IP
address when a port has more than one IP.
I guess the event callback would receive a device pointer and
the same RDMA transport address union I talked about above
when discussing address resolution.
It would be possible to have another function like
rdma_getpeername() that takes the transport address and
returns a source IP address. In the IB case this would do an
ATS reverse lookup. However, I hate this idea. iSER already
uses the CM private data to pass the source IP in the IB case,
and I would much rather fix NFS/RDMA to do the same thing (so
we can just kill ATS as an address resolution method).
- R.
More information about the general
mailing list