[ofa-general] can an ib-bonding slave work independently?

Isaac Huang He.Huang at Sun.COM
Thu Jul 16 11:00:24 PDT 2009


On Wed, Jul 15, 2009 at 10:47:11AM +0300, Or Gerlitz wrote:
> Isaac Huang wrote:
> > [...] bonding device over ib0 and ib2 worked well ib2 as an independent IPoIB device couldn't work 
> > (ICMP pings failed). It was CentOS 5.3, with ib-bonding-0.9.0-28.
> 
> Generally speaking, assigning an IP address and hence a route entry to a slave is non recommended and 

My understanding, which I'd be happy to find false, was that RDMA cmid
couldn't be created and bound to a bonding device. If true, then 
assigning IPs to slaves seemed to be the only way to get ULPs that rely
on the RDMA CM API to work, while the master interface provides failover
to TCP/IP applications.

> doesn't come without pain, e.g see "Potential Sources of Trouble", section 8.1 "Adventures in Routing" of Documentation/networking/bonding.txt, so your problem might have nothing to do with IPoIB. What kernel does CentOS 5.3 comes with? you may be able to use the mainline bonding driver.

Thanks for the pointer; our configuration looked good and the slave
did not have routes that supersede routes of the master:

# ip route show
10.0.0.0/16 dev bond0  proto kernel  scope link  src 10.0.13.49 
10.1.0.0/16 dev ib2  proto kernel  scope link  src 10.1.13.49 

It appeared that all ARP requests over the slave 'ib2' failed, which
was why ICMP pings failed:
# ip neigh show
10.0.1.111 dev bond0 lladdr 80:00:00:48:fe:80:00:00:00:00:00:10:00:03:ba:00:01:00:fc:05 REACHABLE
10.0.1.101 dev bond0 lladdr 80:00:00:48:fe:80:00:00:00:00:00:10:00:03:ba:00:01:00:fb:05 REACHABLE
10.1.1.112 dev ib2  FAILED
10.1.1.132 dev ib2  FAILED
10.1.1.131 dev ib2  FAILED

Rdma_resolve_addr over a cmid bound to the slave also failed with 
RDMA_CM_EVENT_ADDR_ERROR status -ETIMEDOUT.

But tcpdump output on 'ib2' did show the ARP request and response:
15:20:47.571428 arp who-has 10.1.1.132 tell 10.1.13.49 hardware #32
15:20:47.571631 arp reply 10.1.1.132 is-at
80:00:00:49:fe:80:00:00:00:00:00:10:00:03:ba:00:01:00:fb:8a hardware #32

The response seemed to have been dropped by ARP for some reason. The
ARP code appears to match responses with outstanding requests on a
per-interface basis, and drops responses without a matching request on
its incoming interface. When a response arrives on a slave, would it
be considered to have been received from the slave interface or its
master interface? That seemed to me to be the only place where the
responses could be dropped - it worked all fine if bonding was not
enabled.

Thanks,
Isaac



More information about the general mailing list