[ofa-general] ipoib bonding problems in 1.3-beta2 and 1.2.5.4,
Or Gerlitz
ogerlitz at voltaire.com
Wed Dec 5 23:49:20 PST 2007
Vu Pham wrote:
> My systems are RHEL 5.1 x86-64, 2 Sinai hcas, fw 1.2.0
> I setup bonding as follow:
> IPOIBBOND_ENABLE=yes
> IPOIB_BONDS=bond0
> bond0_IP=11.1.1.1
> bond0_SLAVEs=ib0,ib1
> in /etc/infiniband/openib.conf in order to start ib-bond automatically
Hi Vu,
Please note that in RH5 there's a native support for bonding
configuration through the initscripts tools (network scripts, etc), see
section 3.1.2 at the ib-bonding.txt document provided with the bonding
package.
The persistency mechanism which you have used (eg through
/etc/init.d/openibd and /etc/openib.conf) is there only for somehow OLD
distributions for which there's no native (*) support for bonding
configuration, actually I was thinking we wanted to remove it
altogether, Moni?
(*) under RH4 the native support it broken for ipoib/bonding and hence
we patched the some initscripts scripts.
> I moved our systems back to ofed-1.2.5.4 and tested ib-bond again. We
> tested it with ib0 and ib1 (connected to different switch/fabric) been
> on the same subnet (10.2.1.x, 255.255.255.0) and on different subnets
> (10.2.1.x and 10.3.1.x, 255.255.255.0). In both cases there is the issue
> of loosing communication between the servers if nodes have not been on
> the same primary ib interface.
Generally speaking, I don't see the point in using bonding for
--high-availability-- where each slave is connected to different fabric.
This is b/c when there's fail-over in one system you need also the
second system to fail-over, you would also not be able to count on local
link detection mechanisms, since the remote node also must fail-over now
even with his local link being perfectly fine. This is correct
regardless of the interconnect type.
Am I missing something here regarding to your setup?
The question on usage case of bonding over separate fabrics have been
brought to me several times and I gave this answer, no-one ever tried to
educate me why its interesting, maybe you will do so...
Also what do you mean with "ib0 and ib1 been on the same/different
subnets" its only the master device (eg bond0, bond1, etc) with has
association/configuration with an IP subnet, correct?
> 1. original state: ib0's are the primary on both servers - pinging bond0
> between the servers is fine
> 2. fail ib0 on one of the servers (ib1 become primary on this server) -
> pinging bond0 between the servers fails
sure, b/c there's no reason for the remote bonding to issue fail-over
> 3. fail ib0 on the second server (ib1 become primary) - pinging bond0
> between the servers is fine again
indeed.
Or.
More information about the general
mailing list