[ofa-general] ipoib bonding problems in 1.3-beta2 and 1.2.5.4,

Vu Pham vuhuong at mellanox.com
Wed Dec 5 10:00:35 PST 2007


Hi Moni,

My systems are RHEL 5.1 x86-64, 2 Sinai hcas, fw 1.2.0

I setup bonding as follow:
IPOIBBOND_ENABLE=yes
IPOIB_BONDS=bond0
bond0_IP=11.1.1.1
bond0_SLAVEs=ib0,ib1
in /etc/infiniband/openib.conf in order to start ib-bond automatically

Testing with OFED-1.3-beta2, I got the following crash while system is 
booting up

Stack: ffffffff883429d0 fff810428519d30 ................
Call Trace:
[<ffffffff883429d0>] :bonding:bond_get_stats+0x4a/0x131
[<        8020e9cd>] rtnetlink_fill_ifinfo+0x4ba/0x5c4
              ee19>] rtmsg_if info+0x44/0x8d
              eea2>] rtnetlink_event+0x40/0x44
          8006492a>] notifier_call_chain+0x20/0x32
          80208b5e>] dev_open+0x68/0x6e
              72e8>] dev_change_flags+0x5a/0x119
          80239762>] devinet_ioctl+0x235/0x59c
          801ffcf6>] sock_ioctl+0x1c1/0x1e5
          8003fc3f>] do_ioctl+0x21/0x6b
          8002fa45>] vfs_ioctl+0x248/0261
          8004a24b>] sys_ioctl+0x59/0x78
          8005b14e>] system_call+0x7e/0x83

Code: Bad RIP value
RIP [0000000000000000000000] _stext+0x7ffff000/0x1000
 RSP <ffff10428519cc0>
CR2: 000000000000000000000
 <0>Kernel panic - not syncing: Fatal exception

I open bug #812 for this issue.

I moved our systems back to ofed-1.2.5.4 and tested ib-bond again. We 
tested it with ib0 and ib1 (connected to different switch/fabric) been 
on the same subnet (10.2.1.x, 255.255.255.0) and on different subnets 
(10.2.1.x and 10.3.1.x, 255.255.255.0). In both cases there is the issue 
of loosing communication between the servers if nodes have not been on 
the same primary ib interface.

Example:
1. original state: ib0's are the primary on both servers - pinging bond0 
between the servers is fine
2. fail ib0 on one of the servers (ib1 become primary on this server) - 
pinging bond0 between the servers fails
3. fail ib0 on the second server (ib1 become primary) - pinging bond0 
between the servers is fine again

Is this behavior expected?

thanks,
-vu



More information about the general mailing list