[ofa-general] ***SPAM*** Re: pick the outgoing HCA based on the IP used for bind

Richard Frank richard.frank at oracle.com
Thu Feb 5 07:23:53 PST 2009


FWIW - I tested with this patch to rmda_resolve_ip - and found no difference 
in behavior.

At this point I do not think the addr.c patch resolves this... at one point 
we had two patches that were overlapping - both possilby solving the same 
problem... now that rds is explicitly binding to an IP...the resolve_ip 
patch appears to be not needed.

The original problem is that we were not getting to either the HCA or port 
associated with an IP - even in a dual HCA configuration. Now that rds is 
explicitly binding we do get the correct HCA ( based on Or's tests ), 
however, we really want to resolve down to port backing the IP.

----- Original Message ----- 
From: "Or Gerlitz" <ogerlitz at voltaire.com>
To: "Sean Hefty" <sean.hefty at intel.com>
Cc: <general at lists.openfabrics.org>; <rds-devel at oss.oracle.com>; "Richard 
Frank" <richard.frank at oracle.com>
Sent: Thursday, February 05, 2009 6:44 AM
Subject: Re: pick the outgoing HCA based on the IP used for bind


> Hi Sean,
>
> It seems that even when the rdma-cm consumer binds to a specific address,
> the rdma-cm address resolution code follows the order of the devices/rules
> in routing table. So the user can't really dictate an outgoing interface
> based on the src address provided to rdma_resolve_addr. This problem seem 
> to
> happen even if the user first called rdma_bind_addr, so its either same
> issue or that rdma_resolve_addr somehow stepping on the device/port
> "resolved" by rdma_bind_addr.
>
> Consider this system, with two IPoIB intefaces on the same IP subnet using
> the same HCA, each on a different port. The first match for 
> 192.168.10.0/24
> would be ib3. Now I issue a ping with the -I flag, to have the ICMP socket
> bind to a diffrent interface. First, I see that two neighbours has been
> created, each on a different interface, and second from sampling the 
> interface
> packet counters (not brought here) I see that each ping uses the correct 
> interface.
>
> Repeating the same test with rds-ping -I (rds-ping is a user space utility 
> provided
> by the rds-devel package, sending packets through the rds kernel driver) - 
> I can see
> that the two rds rdma-cm ids (rds would have two connections in that case) 
> is using
> the same port, the one corresponding to ib3, the first routing match.
> Below is some info on my system.
>
> Or, when running with multiple HCAs on Linux - we run into an problem with 
> RDS - in that
> rdma_resolve_addr does not pick the outgoing NIC based on the IP we bind 
> to.. it seems
> to always be using the destination IP.
>
> We put this patch together - which solves the problem on Linux... note 
> that this is
> behavior only fails on Linux - it works correctly on HPUX...as an example.
>
> Do you see a problem with proposing that this patch be picked up by OFED ?
>
> Rick Frank who brought this to my attention, also handed me this patch
> which is claimed to workaround this issue, its badly formatted and I
> couldn't really understand what it does. I hoped to be able and reproduce
> this with rping or ucmatose, but neither allow me to specify a -I address
> to the client side, and I don't have the time now for this enhancement.
>
> --- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c
> +++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
> @@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so
>  struct flowi fl;
>  struct rtable *rt;
>  struct neighbour *neigh;
> + struct net_device *dev;
>  int ret;
>
>  memset(&fl, 0, sizeof fl);
>  fl.nl_u.ip4_u.daddr = dst_ip;
>  fl.nl_u.ip4_u.saddr = src_ip;
> +
> + if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) {
> + fl.oif = dev->ifindex;
> + dev_put(dev);
> +
> + ret = ip_route_output_key(&rt, &fl);
> + if (ret == 0)
> + goto found;
> + /* Fall back to using any local device */
> + fl.oif = 0;
> + }
>  ret = ip_route_output_key(&rt, &fl);
>  if (ret)
>  goto out;
>
> +found: ;
> +
>  /* If the device does ARP internally, return 'done' */
>  if (rt->idev->dev->flags & IFF_NOARP) {
>  rdma_copy_addr(addr, rt->idev->dev, NULL);
>
>
>
>
> [root at anise ~]# route -n
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use 
> Iface
> 192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 
> ib3
> 192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 
> ib2
>
>
> [root at anise ~]# ip addr show ib2
> 11: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 
> 256
>    link/infiniband 
> 80:56:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c1 brd
>    inet 192.168.10.60/24 brd 192.168.10.255 scope global ib2
>    inet6 fe80::202:c903:3:17c1/64 scope link
>
> [root at anise ~]# ip addr show ib3
> 12: ib3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen 
> 256
>    link/infiniband 
> 80:56:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c2 brd
>    inet 192.168.10.61/24 brd 192.168.10.255 scope global ib3
>    inet6 fe80::202:c903:3:17c2/64 scope link
>
> [root at anise ~]# ping -I 192.168.10.60 192.168.10.89
> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>
> [root at anise ~]# ping -I 192.168.10.61 192.168.10.89
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
>
> [root at anise ~]# ip n s
> 192.168.10.89 dev ib3 lladdr 
> 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
> 192.168.10.89 dev ib2 lladdr 
> 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
>
> [root at anise ~]# rds-ping -I 192.168.10.60 192.168.10.89
>   3: 33 usec
>
> [root at anise ~]# rds-ping -I 192.168.10.61 192.168.10.89
>   3: 33 usec
>
> [root at anise ~]# rds-info -I
> RDS IB Connections:
>      LocalAddr      RemoteAddr                         LocalDev 
> RemoteDev
>  192.168.10.61   192.168.10.89              fe80::2:c903:3:17c2 
> fe80::2:c902:22:efe5
>  192.168.10.60   192.168.10.89              fe80::2:c903:3:17c2 
> fe80::2:c902:22:efe5
> 




More information about the general mailing list