[openib-general] IB Address Translation service

Tue Mar 1 02:29:46 PST 2005

Eric, let me correct some of your assumptions
Which this API is actually targeting to protect against, see below 

> -----Original Message-----
> From: Eric W. Biederman [mailto:eric at lnxi.com] On Behalf Of Eric W.
> Biederman
> Sent: Tuesday, March 01, 2005 9:18 AM
> To: Yaron Haviv
> Cc: Roland Dreier; shaharf; openib-general at openib.org
> Subject: Re: [openib-general] IB Address Translation service
> 
> "Yaron Haviv" <yaronh at voltaire.com> writes:
> 
> > > -----Original Message-----
> > > From: openib-general-bounces at openib.org [mailto:openib-general-
> > > bounces at openib.org] On Behalf Of Roland Dreier
> > > Sent: Monday, February 28, 2005 7:13 PM
> > > To: shaharf
> > > Cc: openib-general at openib.org
> > > Subject: Re: [openib-general] IB Address Translation service
> > >
> > > This API seems overly complex and at the same time too inflexible
to
> > > me.  However, rather than getting bogged down nitpicking about
APIs, I
> > > think we have to take a few steps back.
> >
> > I believe the API is very flexible, but we are pretty open to here
what
> > you think is needed in addition
> >
> > > First, let's understand the problem we're trying to solve.  Who
are
> > > the consumers of this address translation service?
> >
> > The first problem is that most ULPs use valid IP addresses for
> > simplicity (DAPL, iSER, NFS/RDMA, SDP, MPI, etc') and someone needs
to
> > resolve it to an IB address and device to use IB. This should take
into
> > account cases where there are more than one HCAs in the system.
> > Preferable/optionally the ULP would like to know which partition to
use
> > if there is more than one, and leverage on the IP subnetting done by
> > IPoIB.
> 
> I am confused.  In any sane network the translation is:
> Hostname -> address.
> 
> IP because it spans multiple networks does:
> Hostname -> IP address -> hw address.
> 
> IB because it can span multiple IB networks does:
> GUID+QPN -> LID + QPN.
> 
> So what is wrong with simply doing:
> Hostname -> GUID
> ???

1. In standard protocols such as SDP, iSER, NFS/RDMA, Oracle, .. (unlike
OSU MPICH) the name service is one of the standard IP name services
mapping Host names to IP addresses, and the ULP accepts a destination IP
and NOT a Host name.

2. InfiniBand Hardware address is a GID and not LID, LID is a path
attribute implemented to avoid the slow 48 bit lookup done in Ethernet
and enable multi-pathing. A LID address is dynamically allocated; you
may also have multiple LID addresses per port.
(OSU MPICH implementation is a bad example for IB citizenship) 

So to summaries:

Ethernet:   Host Name -> IP -> MAC Address 
InfiniBand: Host Name -> IP -> GID Address -> Path (LID, SL, ..)

So If we intend to relay on standard name services we can start with IP
(or implement a proprietary name service for Name->HW Addr if we wish)

Than we need to translate an IP to HW address (GID/GUID) and the
equivalent of VLANs (partitions), this is provided by the
ib_at_route_by_ip call
And internally it is based on IP and IPoIB mechanisms similar to how
Libor implemented it in SDP (and optionally if we see a need using ATS).

Than in IB we need to resolve a GID to path attributes, which consist of
LID, SL/VL, MTU, etc'
The inputs to that are the source, destination, partition and QoS
attributes, and the result is a path, since IB also support
Multi-pathing, a user may receive multiple paths that can be used for
high-availability, performance aggregation, or source based routing.
A path may also travel through isolated congestion domains using VLs. 

The ib_at_paths_by_route call allows resolving HW Address + preferences
to one or more path records that are than used by the ULP & CM.
It can also be used by non-IP based ULP's such as SRP or MPICH, that is
why the API unlike the current SDP implementation is divided to 2 calls
one for HW address, and one for path.

Currently OSU MPICH is using Proprietary Name and LID+QP assignment, it
doesn't work the standard IB way with SA & CM, which is not making use
of a lot of IB capabilities, and is also making it more static and less
robust, I wouldn't use that as the example for ULP implementation.
The MPI layer which doesn't have any idea about the fabric
routing/utilization/availability is determining the path. 
Another simple scenario your application requires is to run MPI and NFS
on different IB VLs, today you need to manually configure (recompile)
that in each ULP, with that proposal it can be done automatically with a
central configuration on the SM.

On the other hand SDP uses same mechanisms; however we cannot use it for
other ULP's (e.g. kDAPL), and also it is missing functionality that is
needed by many of our users.
The proposal calls for doing one set of calls for current and future
ULP's.

> It would be brain damaged for DAPL to require IP addresses.  Not that
> DAPL hasn't shown some brain damage already.

DAPL use IP addresses since it is a common API for IB & Ethernet/RDMA,
I'm not sure what is wrong with IP, millions use it and are familiar
with it, which is something I cant say about GIDs & LIDs.

> You can't do GUID -> IP because there is not a requirement on
> a 1 to 1 mapping.  And in general there is no fixed IP -> GUID
mapping.

If you dig into the call, it returns an array of IPs, you can also
specify VLAN (P_Key).

> 
> What are the semantics in the upper levels when the IP -> GUID mapping
> changes?  Does you connection properly follow the IP to the new GUID?
> 

That's a ULP implementation question; I believe in general it shouldn't.

> Just FYI IPv6 doesn't use arp.

The implementation will depend on the IP stack to provide the IP->GID so
it supports both IPv4 & IPv6.

Yaron