[ofa-general] Re: pick the outgoing HCA based on the IP used for bind

Or Gerlitz ogerlitz at voltaire.com
Thu Feb 5 03:44:53 PST 2009


Hi Sean,

It seems that even when the rdma-cm consumer binds to a specific address,
the rdma-cm address resolution code follows the order of the devices/rules
in routing table. So the user can't really dictate an outgoing interface
based on the src address provided to rdma_resolve_addr. This problem seem to
happen even if the user first called rdma_bind_addr, so its either same
issue or that rdma_resolve_addr somehow stepping on the device/port
"resolved" by rdma_bind_addr.

Consider this system, with two IPoIB intefaces on the same IP subnet using
the same HCA, each on a different port. The first match for 192.168.10.0/24
would be ib3. Now I issue a ping with the -I flag, to have the ICMP socket
bind to a diffrent interface. First, I see that two neighbours has been
created, each on a different interface, and second from sampling the interface
packet counters (not brought here) I see that each ping uses the correct interface.

Repeating the same test with rds-ping -I (rds-ping is a user space utility provided
by the rds-devel package, sending packets through the rds kernel driver) - I can see
that the two rds rdma-cm ids (rds would have two connections in that case) is using
the same port, the one corresponding to ib3, the first routing match.
Below is some info on my system.

Or, when running with multiple HCAs on Linux - we run into an problem with RDS - in that
rdma_resolve_addr does not pick the outgoing NIC based on the IP we bind to.. it seems
to always be using the destination IP.

We put this patch together - which solves the problem on Linux... note that this is
behavior only fails on Linux - it works correctly on HPUX...as an example.

Do you see a problem with proposing that this patch be picked up by OFED ?

Rick Frank who brought this to my attention, also handed me this patch
which is claimed to workaround this issue, its badly formatted and I
couldn't really understand what it does. I hoped to be able and reproduce
this with rping or ucmatose, but neither allow me to specify a -I address
to the client side, and I don't have the time now for this enhancement.

--- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c
+++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
@@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so
  struct flowi fl;
  struct rtable *rt;
  struct neighbour *neigh;
+ struct net_device *dev;
  int ret;

  memset(&fl, 0, sizeof fl);
  fl.nl_u.ip4_u.daddr = dst_ip;
  fl.nl_u.ip4_u.saddr = src_ip;
+
+ if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) {
+ fl.oif = dev->ifindex;
+ dev_put(dev);
+
+ ret = ip_route_output_key(&rt, &fl);
+ if (ret == 0)
+ goto found;
+ /* Fall back to using any local device */
+ fl.oif = 0;
+ }
  ret = ip_route_output_key(&rt, &fl);
  if (ret)
  goto out;

+found: ;
+
  /* If the device does ARP internally, return 'done' */
  if (rt->idev->dev->flags & IFF_NOARP) {
  rdma_copy_addr(addr, rt->idev->dev, NULL);




[root at anise ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 ib3
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 ib2


[root at anise ~]# ip addr show ib2
11: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 256
    link/infiniband 80:56:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c1 brd
    inet 192.168.10.60/24 brd 192.168.10.255 scope global ib2
    inet6 fe80::202:c903:3:17c1/64 scope link

[root at anise ~]# ip addr show ib3
12: ib3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 256
    link/infiniband 80:56:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c2 brd
    inet 192.168.10.61/24 brd 192.168.10.255 scope global ib3
    inet6 fe80::202:c903:3:17c2/64 scope link

[root at anise ~]# ping -I 192.168.10.60 192.168.10.89
2 packets transmitted, 2 received, 0% packet loss, time 999ms

[root at anise ~]# ping -I 192.168.10.61 192.168.10.89
3 packets transmitted, 3 received, 0% packet loss, time 1999ms

[root at anise ~]# ip n s
192.168.10.89 dev ib3 lladdr 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
192.168.10.89 dev ib2 lladdr 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE

[root at anise ~]# rds-ping -I 192.168.10.60 192.168.10.89
   3: 33 usec

[root at anise ~]# rds-ping -I 192.168.10.61 192.168.10.89
   3: 33 usec

[root at anise ~]# rds-info -I
RDS IB Connections:
      LocalAddr      RemoteAddr                         LocalDev                        RemoteDev
  192.168.10.61   192.168.10.89              fe80::2:c903:3:17c2             fe80::2:c902:22:efe5
  192.168.10.60   192.168.10.89              fe80::2:c903:3:17c2             fe80::2:c902:22:efe5



More information about the general mailing list