[openib-general] Re: [PATCH] [ib_addr] generalize address to RDMA device translation

Tue Jan 3 14:01:30 PST 2006

openib-general-bounces at openib.org wrote:
> On Tue, 2006-01-03 at 12:05 -0800, Sean Hefty wrote:
>> Tom Tucker wrote:
>>> ARP Resolve
>>> 
>>> The iWARP side needs to be able to resolve an IP address to an
>>> Ethernet address. Today this is not done for iWARP and it works
>>> because the AMSO1100 does this itself in the hardware. Other iWARP
>>> devices probably don't. This means that the logic in ib_at needs to
>>> be extended on the iWARP side to call neigh_event_send (instead of
>>> arp_send) to resolve an IP to an Ethernet address.  The current
>>> method of calling arp_send directly and "sniffing" for arp replies
>>> is probably not the best way to go long term. It would be better to
>>> register for neighbor update events (new mechanism) and
> be notified when the neighbor entry gets resolved.
>>> This is better for two reasons: 1) it doesn't duplicate code already
>>> in Linux, and 2) unlike IB, Ethernet MAC addresses may change for
>>> the next hop while the connection is still active. The provider
>>> needs to know this so it's hardware ARP tables can be updated.
>> 
>> To be clear, the CMA uses ib_addr, and not ib_at, which is a
>> different module. 
> 
> Absolutely. I was dumping a bunch of loosely related concerns...
> 
>> 
>> I'm not sure I understand what's wrong with sniffing arp replies.
>> There's very little code (about a dozen lines) in ib_addr to handle
>> arps.  It also seems that it's just as unlikely that the mapping from
>> an IP address to a hardware address will change for Ethernet as it
>> does for IB. 
> 
> Agreed -- It is unlikely. The more common case is a re-arp
> when the arp entry times out (typically 15 minutes).
> 
> 

It is unlikely, but it is also crucial. IP failover within a subnet
is dependent on arp updates.

Manual entering new ARP translations is even rarer (I think I've done
it about six times in nearly 20 years of working with IP networks),
but it is legal. And it is something that IP network administrators
can do now, and they do not expect RDMA to break it.

>> Are you trying to deal with a destination IP address of a connection
>> that is not on the local subnet?  If this is the case, then this
>> seems like a separate issue than address resolution.
> 
> Yes, and no. The IP address being resolved is the peer if it
> is on the same subnet. If it is not, then the IP address
> being resolved is for the next hop.
> 
>> 
>>> ROUTE Changes
>>> 
>>> Two obvious cases, 1) the next hop changes due to normal network
>>> least- cost routing, and 2) the user changes a route manually. Both
>>> events would require the iWARP provider to be notified (via an event
>>> again) and update its hardware
>> 
>> Maybe this can be included as part of some sort of automatic
>> "failover"? Otherwise, I'm not sure how this functionality maps to
>> IB. It's not a big deal if it doesn't, but it'd be nice to keep
>> similarities where possible. 

The key point is that the IP layer implemented by the RNIC has
to be working from the same data as the IP layer implemented by
Linux. Since Linux does not implement the IB transport layer 
the same issue is not likely to come up.

In an IP network, changing routes is supposed to be transparent
to established connections, especially if there is no PMTU decrease.
So trying to map it to IB APM won't be a fit.

> 
>>> PathMTU
>>> 
>>> The new route to the remote peer has a hop with a smaller MTU than
>>> we're currently using. Ouch! All my packets are going to be dropped
>>> until I reduce my path MTU. The provider can't know unless he is
>>> either filtering all ICMP traffic himself ("evil") or is notified
>>> via an event ("nice"). 
>>> 
>>> So all this said, my little brain had imagined this logic going in
>>> and around the ib_at module in a wonderfully crafted bit of
>>> algorithmic art -- once I figured out how to do it all ;-)
>>> 
>>> It sounds like you're beating the same bushes. How would you like to
>>> proceed?
>> 
>> I'd like to define a set of changes to ib_addr and the rdma_cm that
>> makes it easier to support multiple RDMA devices, then
> evolve the codebase from there.
>> My hope is to keep the network addressing ugliness in ib_addr.
>> 
>> The changes to the ib_addr interface is based on trying to determine
>> what might help support iWarp after looking at your patch.  If the
>> changes appear to be a step in the right direction, then I will
>> commit them.  The essence of the change is that ib_addr leaves the
>> interpretation of the addresses up to the caller, which may still be
>> a good thing even if it doesn't directly make supporting iWarp any
>> easier. 
> 
> My 2 cents is that it's a good thing. Sorry to throw 10 lbs
> of @#^$ in with this bag... I was core dumping.
> 

I agree with Tom's assessment here. Leaving the interpretation of
the rdma_addr up to the rdma transport device is a necessary
and solid first step, but it should be understood that there
are some related issues that will also have to be addressed.
A major part of the semantics of an IP address will have to
include that it has consistent meaning whether you are working
through the RDMA interface or the L2/network device interface.

I understand that it is a bit trickier with IPoIB, but the
principle still stands that there should be some correlation
between how packets are handled for an RDMA and SOCK_STREAM
connection that both are established to the same remote IP
Address.

>From the iWARP side we can discuss what integrations we need
to properly integrate iWARP with the L2 device to avoid these
problems. As these proposals are brought up, we should have
some review to double-check that we have properly minimized
the impact on netdev and so that the hooks can be defined
in a way that will allow native IB, SDP/IB and IPoIB to
have similar consistency guarantees.