[openib-general] [PATCH] IPOIB: Use a GRH when appropriate for unicast packets

Wed Feb 7 15:33:57 PST 2007

On Wed, Feb 07, 2007 at 02:31:10PM -0800, Sean Hefty wrote:

> >I agree that special casing some IPv6 addresses is a bad idea. It
> >needs to be integrated correctly with NET and the routing table/etc

> I haven't given this more than a few minutes of thought, but I was thinking 
> more along the lines of a port having an assigned GID that's the same as an
> assigned IPv6 address.  (Is there some reason this wouldn't work?)  IP name 
> service resolution would map the name to the IPv6 address.  The mapping 
> from the IPv6 address to a GID would then be straightforward, as opposed to 
> using a mapping using ARP.

Right, I also like the idea of using DNS as a global GID name
service.

> If name service resolution gives me an IPv6 address that's off of the local
> subnet, but the ARP response gives me an address that's on the local 
> subnet, then I think we can assume that ARP was unsuccessful is resolving 
> the address to the remote GID.  (I.e. the GID should be for a router.)  If 
> this is true, then we need some other way to acquire the DGID.

This is where I think you have problems... Why would you ARP for an
off-subnet address? Why would the router answer?  You push the address
through the route table and ARP the router address that results.

All of that is why I think another netdevice is a tidy
solution. ping6/tcp/etc using this device would generate packets that
follow the same path as RMDA connections would. No special rules about
broadcast groups are required. The route table is used to instruct the
kernel what IPv6 prefixes are IB GIDs and which are not by associating
the output of the route with the ib0 device. The admins can use any
means to set that up. Something that looks like:

$ ip addr
1: ib0: <BROADCAST,MULTICAST,UP,10000> mtu 2048 qdisc pfifo_fast qlen 1000
    link/ib [my GID..]
    inet6 fe80::c2/64 scope link dynamic <<-- My LL GID
    inet6 2000::c2/64 scope global dynmaic  <<-- My GID

Both are maintained by the kernel.

$ ip -6 route
fe80::/64 dev ib0
2000::/64 dev ib0 src 2000::c2
2001::/64 dev ib0 src 2000::c2  <<-- Tells the kernel that 2001::/64
                                     is a GID and to use path records
                                     to do lookups at the SM
2002::/64 via fe80::a0 ib0 src 2000::c2 <<--- 2002::/64 is a GID
                                              but don't query the SM and
					      direct things to IB
					      router fe80::a0
$ ping6 -I ib0 2001::b1
 ^--- Generate packet structured as: LRH,GRH,ICMP6,PING_DATA
      Set the GRH.SGID to 2000::c2, DGID to 2001::b1 as per the route
      table
      Do a SM Path Record query for 2001::b1 and use that to set the LRH
$ ping6 -I ib0 2002::b1
 ^--- Generate packet structured as: LRH,GRH,ICMP6,PING_DATA
      Set the GRH.SGID to 2000::c2, DGID to 2002::b1 as per the route
      table
      Do a SM Path Record query for fe80::a0 and use that to set the
      LRH
$ traceroute6 -I ib0 2001::b1
 ^--- Same as the ping, except the IB router can capture the packet when
      the hop limit runs out an produce an ICMP error.

Note: In all three cases the LRH.LNH would be set to 1 (non-IBA raw
IPv6). RDMA CM would use the usual value of 3.

This also provides at least a mechanism, if not a full solution, to
the MTU problem. Linux already allows route entries to specify a MTU
and with closer integration of the raw IPV6 stuff it becomes possible
for routers to send ICMP6 errors as raw IPv6 packet and for Linux to
capture them and update the route. The ICMP6 errors are crucial to
having path MTU type functions converge quickly.

RDMA CM would use the same rules for addressing CM packets.

A further refinement would be to layer the entire path record query
mechanism in the kernel over this so that the admin has local control
over the IB routing table (if desired). A 2nd refinement would be to
use the ND cache of such an ib0 device as a local path record query
cache (again lets the admin see what is going on and override/discard
SA queries using the usual 'ip neigh' command). There might even be
good potential for sa replication using the already existing userspace
arpd stuff.

Overall I would just view something like this as further integrating
the IB stack with the existing rich services provided by NET rather
than trying to duplicate a small portion of them with seperate
interfaces. [For instance with something like this netlink could be
used instead of the sysfs probing for many cases]

But yes, it is a bit outside what the current framework envisions..

Jason