[openib-general] RE: Connection and Address Translation

Caitlin Bestler caitlinb at broadcom.com
Wed Aug 24 10:39:17 PDT 2005


I have several comments on this topic.

First, I strongly endorse the policy decision made long
ago in the DAT Collaborative that a network address is 
a flat numeric identifier with IPv6 semantics. I still
believe that all interfaces designed for application
developers should follow that form.

Now admittedly the interface used between the DAT Provider (and
other middleware and maybe a handful of highly sophisticated
kernel applications)is a different question. Such an interface
could distinquish between an Address with IPv6 semantics and
a lower layer "RDMA Address", but such a distinction would 
be of minimal benefit to the DAT Provider. Therefore I think
it needs to be justified somewhere. The only benefit I see
in the context of kDAPL is that it moves some logic from
the device-dependent verbs to core code.

One thing that we have to be careful about in defining this
additional API is that the nature of iWARP paths is not
artificially frozen. In particular Roland's proposal comes
close to assuming that iWARP is a single-path-only subset
of InfiniBand. That is  incorrect. It is more correct to say
that path selection, including path migration, takes place at
L3 or L4 and is invisible to L5.

This includes both multi-homed IP transports (such as SCTP)
and the IP layer itself (where an IP address can be migrated
to a new Ethernet port). I will be presenting a paper at the
RAIT conference dealing with a multi-homed option for MPA/TCP.
So there are several path failover options for iWARP. The 
difference from InfiniBand is not that it has only a single
path, but that path selection and failover is transparent to
the RDMA layer. It is also transparent to the kDAPL consumer,
so exposing path failover to the DAT Provider in a way that
interferes with below-RDMA path failure in iWARP would be
a mistake.

In particular, if we purse the "RDMA Address" proposal it 
should be clear that the resolved "source address" can still
be a "Don't Care".

When iWARP Connection Management is implemented over TCP the
effect of specifying a local address is to do a bind on the
socket before calling connect. For the vast majority of topologies
the destination address alone is sufficient to ensure routing 
through the correct RDMA device (and it can avoid any pre-mature
selection of a specific Ethernet port).

The problem with listen is more complex. The fully correct interface
from the application perspective would allow the application to 
listen on a *set* of local addresses. But an efficient transport
neutral definition of a "set of local addresses" is not an easy
thing to come up with. The decision within the DAT Collaborative
was to punt on this issue. The Consumer was required to issue a
single listen for the Service ID/Port for an RDMA device, and then
to figure out what to do with the Connection Request based on both
the local address requested and potentiall the remote address. In the
IP world it is very common for content servers (HTTP, FTP, ...) to
present different content based upon which of the server's IP addresses
was requested.

It is also very easy to listen on a *single* IA Address in a transport
neutral fashion.

So barring any great inspirations about how to represent a set of addresses,
I would suggest that we stick with the single Address or all addresses
supported
by the device approach. It certainly does not make sense for IB to emulate
IP subnet masking.

That means there are three services needed.  These are in fact identified by
DAT, but they are not specified. The Consumer was directed to use the
existing
OS specific solutions to perform these functions:

1) Select IP Interface given desired destination address and Class of
Service.
2) Select IP interface given desired local address.
3) Select DAT Device matching IP Interface.

DAT, being OS neutral, did not specify these functions. Working within Linux
they can be specified. The intent is for the first two to match the same
APIs/
procedures used for sockets, firewalls and the first hop routing.


Ultimately this means that on the listen side the verbs consumer must be able
to:

a) listen on either a specific address/port or on "all addresses/specific
port" for the device.
b) Receive the actual local address used in the Connection Request.
c) Be able to query the remote address given a Connection Request, but this
does not need to
   be delivered by default.

I will also point out that there is nothing in the DAT interface that
requires an extra
wrie step for InfiniBand. The definition of IA Address is specifically
designed to support
division by subnetting, and this is even assumed when multiple DAT Providers
are supported
that use different transports.

The System/Network administrators are *already* required to ensure that the
128-bit IPv6-like
address space is divided unambiguously between the different RDMA devices.
Further they are
required to guarantee that the IA Address mapping does not contradict the
mapping of IP
addresses.

That means that the System/Network administrators can easily define one ore
more IPv6
network IDs that translate directly to assigned GIDs. If there is only a
single InfiniBand
subnet it can even be the link-local prefix. With such a solution there is no
run-time
penalty for supplying both local and remote "IA Addresses" in a connection
request.


As long as the "unspecified" local  address remains a valid option the
proposed API 
split is defnitely implementable under iWARP. But it is obviously simpler to
just
keep the same semantics as the exported kDAPL interface. Is there a definite
benefit
to making the change, if so what and for which middleware/application?





More information about the general mailing list