[openib-general] IB Address Translation service

Mon Feb 28 23:17:35 PST 2005

"Yaron Haviv" <yaronh at voltaire.com> writes:

> > -----Original Message-----
> > From: openib-general-bounces at openib.org [mailto:openib-general-
> > bounces at openib.org] On Behalf Of Roland Dreier
> > Sent: Monday, February 28, 2005 7:13 PM
> > To: shaharf
> > Cc: openib-general at openib.org
> > Subject: Re: [openib-general] IB Address Translation service
> > 
> > This API seems overly complex and at the same time too inflexible to
> > me.  However, rather than getting bogged down nitpicking about APIs, I
> > think we have to take a few steps back.
> 
> I believe the API is very flexible, but we are pretty open to here what
> you think is needed in addition 
> 
> > First, let's understand the problem we're trying to solve.  Who are
> > the consumers of this address translation service?
> 
> The first problem is that most ULPs use valid IP addresses for
> simplicity (DAPL, iSER, NFS/RDMA, SDP, MPI, etc') and someone needs to
> resolve it to an IB address and device to use IB. This should take into
> account cases where there are more than one HCAs in the system.
> Preferable/optionally the ULP would like to know which partition to use
> if there is more than one, and leverage on the IP subnetting done by
> IPoIB.

I am confused.  In any sane network the translation is:
Hostname -> address.

IP because it spans multiple networks does:
Hostname -> IP address -> hw address.

IB because it can span multiple IB networks does:
GUID+QPN -> LID + QPN.

So what is wrong with simply doing:
Hostname -> GUID
???

Then all the kernel needs to be passed GUID + QPN.

I am certain MPI does not care about IP addresses.   It is the job
of the mpi launcher to resolve where all of the pieces are.  Generally
mpirun is done over IP and it just needs to collect the native network
addresses before it leaves.

It would be brain damaged for DAPL to require IP addresses.  Not that
DAPL hasn't shown some brain damage already.

Please, please remember that IP addresses 

> It is possible to replicate the same code you have in SDP (which is also
> not complete) across all ULP's, I assume a better way is to provide it
> in one central place.

How about not even worrying about it.  It is an extra step that
introduces latency and confusion.  

You can't do GUID -> IP because there is not a requirement on 
a 1 to 1 mapping.  And in general there is no fixed IP -> GUID mapping.

What are the semantics in the upper levels when the IP -> GUID mapping
changes?  Does you connection properly follow the IP to the new GUID?

I don't see this making sense anywhere except user space.

> There are also two proposed address resolution mechanisms, one is ARP
> used by SDP, and one is ATS used by some DAPL consumers, and we believe
> it is better to combine them under the same API.

Just FYI IPv6 doesn't use arp.

> The second problem relates to mapping of IB GID to one or more Path
> records
> This is also something needed for ALL ULP's. today each ULP provides the
> minimal subset of path resolution functionality without taking into
> account topics such as partitioning, QoS, source routing and
> multi-pathing.
> Some of these require using special SA queries (such as SA Multipath
> Record query and QoSPath Query).
> I don't think it make sense to put all this functionality into each ULP
> as well.

That part is reasonable.  Although the fact it is easy to knock
OpenSM down concerns me.  However that looks to be a separate
problem.

> Than we can also discuss, does it make sense to have each path
> resolution call lead us to the sa, or does it make more sense to cache
> those paths.
> And if we cache, doesn't it make more sense to cache/invalidate the
> routes to all ULP's rather implementing/having it in each ULP.
> Also not sure how a 1000 node cluster functions without the caching.
>  
> And the last problem is related to reverse resolution from IB to IP
> addresses that is needed for DAPL, as well as for different management
> and diagnostic tools that want to know what is really that node/port
> behind that GID addresses.
> 
> So how would you suggest to go about it ?
> Duplicate all of that in each ULP ?
> Refrain from implementing advanced routing, partitioning, QoS (we cant
> really maintain all that advanced code for each ULP) ? 

One small step at a time.  Where each step is obviously correct.

One giant leap only works well for internal use.  Not for things
that are heavily used.

> Our idea is to provide those few helper functions that enable people to
> make full use of IB and its features without reading all the IB spec,
> and a Phd.
> If you clear all the remarks from the library, you will see it is very
> slim, and for my understanding includes all the relevant input and
> output parameters for each of the 3 functions I mentioned.

But an interface like that is usually provided by glibc not by the kernel.
At the mixing of levels in that proposed API is absolutely horrible.

Eric