[ofa-general] ***SPAM*** Re: pick the outgoing HCA based on the IP used for bind
Richard Frank
richard.frank at oracle.com
Thu Feb 5 07:23:53 PST 2009
FWIW - I tested with this patch to rmda_resolve_ip - and found no difference
in behavior.
At this point I do not think the addr.c patch resolves this... at one point
we had two patches that were overlapping - both possilby solving the same
problem... now that rds is explicitly binding to an IP...the resolve_ip
patch appears to be not needed.
The original problem is that we were not getting to either the HCA or port
associated with an IP - even in a dual HCA configuration. Now that rds is
explicitly binding we do get the correct HCA ( based on Or's tests ),
however, we really want to resolve down to port backing the IP.
----- Original Message -----
From: "Or Gerlitz" <ogerlitz at voltaire.com>
To: "Sean Hefty" <sean.hefty at intel.com>
Cc: <general at lists.openfabrics.org>; <rds-devel at oss.oracle.com>; "Richard
Frank" <richard.frank at oracle.com>
Sent: Thursday, February 05, 2009 6:44 AM
Subject: Re: pick the outgoing HCA based on the IP used for bind
> Hi Sean,
>
> It seems that even when the rdma-cm consumer binds to a specific address,
> the rdma-cm address resolution code follows the order of the devices/rules
> in routing table. So the user can't really dictate an outgoing interface
> based on the src address provided to rdma_resolve_addr. This problem seem
> to
> happen even if the user first called rdma_bind_addr, so its either same
> issue or that rdma_resolve_addr somehow stepping on the device/port
> "resolved" by rdma_bind_addr.
>
> Consider this system, with two IPoIB intefaces on the same IP subnet using
> the same HCA, each on a different port. The first match for
> 192.168.10.0/24
> would be ib3. Now I issue a ping with the -I flag, to have the ICMP socket
> bind to a diffrent interface. First, I see that two neighbours has been
> created, each on a different interface, and second from sampling the
> interface
> packet counters (not brought here) I see that each ping uses the correct
> interface.
>
> Repeating the same test with rds-ping -I (rds-ping is a user space utility
> provided
> by the rds-devel package, sending packets through the rds kernel driver) -
> I can see
> that the two rds rdma-cm ids (rds would have two connections in that case)
> is using
> the same port, the one corresponding to ib3, the first routing match.
> Below is some info on my system.
>
> Or, when running with multiple HCAs on Linux - we run into an problem with
> RDS - in that
> rdma_resolve_addr does not pick the outgoing NIC based on the IP we bind
> to.. it seems
> to always be using the destination IP.
>
> We put this patch together - which solves the problem on Linux... note
> that this is
> behavior only fails on Linux - it works correctly on HPUX...as an example.
>
> Do you see a problem with proposing that this patch be picked up by OFED ?
>
> Rick Frank who brought this to my attention, also handed me this patch
> which is claimed to workaround this issue, its badly formatted and I
> couldn't really understand what it does. I hoped to be able and reproduce
> this with rping or ucmatose, but neither allow me to specify a -I address
> to the client side, and I don't have the time now for this enhancement.
>
> --- ofa_kernel-1.3.1.orig/drivers/infiniband/core/addr.c
> +++ ofa_kernel-1.3.1/drivers/infiniband/core/addr.c
> @@ -174,15 +174,29 @@ static int addr_resolve_remote(struct so
> struct flowi fl;
> struct rtable *rt;
> struct neighbour *neigh;
> + struct net_device *dev;
> int ret;
>
> memset(&fl, 0, sizeof fl);
> fl.nl_u.ip4_u.daddr = dst_ip;
> fl.nl_u.ip4_u.saddr = src_ip;
> +
> + if (src_ip && (dev = ip_dev_find(src_ip)) != NULL) {
> + fl.oif = dev->ifindex;
> + dev_put(dev);
> +
> + ret = ip_route_output_key(&rt, &fl);
> + if (ret == 0)
> + goto found;
> + /* Fall back to using any local device */
> + fl.oif = 0;
> + }
> ret = ip_route_output_key(&rt, &fl);
> if (ret)
> goto out;
>
> +found: ;
> +
> /* If the device does ARP internally, return 'done' */
> if (rt->idev->dev->flags & IFF_NOARP) {
> rdma_copy_addr(addr, rt->idev->dev, NULL);
>
>
>
>
> [root at anise ~]# route -n
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use
> Iface
> 192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0
> ib3
> 192.168.10.0 0.0.0.0 255.255.255.0 U 0 0 0
> ib2
>
>
> [root at anise ~]# ip addr show ib2
> 11: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen
> 256
> link/infiniband
> 80:56:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c1 brd
> inet 192.168.10.60/24 brd 192.168.10.255 scope global ib2
> inet6 fe80::202:c903:3:17c1/64 scope link
>
> [root at anise ~]# ip addr show ib3
> 12: ib3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast qlen
> 256
> link/infiniband
> 80:56:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:03:17:c2 brd
> inet 192.168.10.61/24 brd 192.168.10.255 scope global ib3
> inet6 fe80::202:c903:3:17c2/64 scope link
>
> [root at anise ~]# ping -I 192.168.10.60 192.168.10.89
> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>
> [root at anise ~]# ping -I 192.168.10.61 192.168.10.89
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
>
> [root at anise ~]# ip n s
> 192.168.10.89 dev ib3 lladdr
> 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
> 192.168.10.89 dev ib2 lladdr
> 80:f4:04:04:fe:80:00:00:00:00:00:00:00:02:c9:02:00:22:ef:e5 STALE
>
> [root at anise ~]# rds-ping -I 192.168.10.60 192.168.10.89
> 3: 33 usec
>
> [root at anise ~]# rds-ping -I 192.168.10.61 192.168.10.89
> 3: 33 usec
>
> [root at anise ~]# rds-info -I
> RDS IB Connections:
> LocalAddr RemoteAddr LocalDev
> RemoteDev
> 192.168.10.61 192.168.10.89 fe80::2:c903:3:17c2
> fe80::2:c902:22:efe5
> 192.168.10.60 192.168.10.89 fe80::2:c903:3:17c2
> fe80::2:c902:22:efe5
>
More information about the general
mailing list