[ewg] Crash in bonding
Pradeep Satyanarayana
pradeeps at linux.vnet.ibm.com
Mon Nov 2 14:41:36 PST 2009
This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too.
The steps to recreate the crash are as follows:
1. Run traffic (I used ping) on the IB interfaces through the bond master
2. ifdown ib0
3. ifdown ib1
4. modprobe -r ib_ipoib
Quite often, the crash stack trace seen is as follows:
ID: 0 TASK: ffff81087fc11820 CPU: 13 COMMAND: "swapper"
#0 [ffff81010ff07ab0] crash_kexec at ffffffff800ac5b9
#1 [ffff81010ff07b70] __die at ffffffff80065127
#2 [ffff81010ff07bb0] do_page_fault at ffffffff80066da7
#3 [ffff81010ff07ca0] error_exit at ffffffff8005dde9
#4 [ffff81010ff07d58] neigh_connected_output at ffffffff8022cb87
#5 [ffff81010ff07d88] ip_output at ffffffff800320ac
#6 [ffff81010ff07db8] ip_queue_xmit at ffffffff8003464d
#7 [ffff81010ff07e78] tcp_transmit_skb at ffffffff80021d73
#8 [ffff81010ff07ec8] tcp_retransmit_skb at ffffffff80250ccd
#9 [ffff81010ff07f08] tcp_write_timer at ffffffff80252652
#10 [ffff81010ff07f28] run_timer_softirq at ffffffff800968be
#11 [ffff81010ff07f58] __do_softirq at ffffffff8001235a
#12 [ffff81010ff07f88] call_softirq at ffffffff8005e2fc
#13 [ffff81010ff07fa0] do_softirq at ffffffff8006cb14
#14 [ffff81010ff07fb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#15 [ffff81010ff03e48] apic_timer_interrupt at ffffffff8005dc8e
[exception RIP: mwait_idle+54]
RIP: ffffffff800571f4 RSP: ffff81010ff03ef0 RFLAGS: 00000246
RAX: 0000000000000000 RBX: 000000000000000d RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff80301698
RBP: ffff81087fc11a10 R8: ffff81010ff02000 R9: 0000000000000032
R10: ffff81048e0cc4f0 R11: ffff8103ebafcd18 R12: 0000000005f33f4d
R13: 00000d12e63d7223 R14: ffff81047fe797a0 R15: ffff81087fc11820
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#16 [ffff81010ff03ef0] cpu_idle at ffffffff8004939e
I was able to set up some break points and the analysis follows.
cpu 0x1 stopped at breakpoint 0x1 (d000000000ec4214 .bond_release+0x0/0x4d0 [bonding])
mflr r0
enter ? for help
1:mon> t
[link register ] d000000000ecdf80 .bonding_store_slaves+0x304/0x3f0 [bonding]
[c00000000fd97b00] d000000000ecdf70 .bonding_store_slaves+0x2f4/0x3f0 [bonding] (unreliable)
[c00000000fd97bd0] c00000000029a660 .class_device_attr_store+0x44/0x60
[c00000000fd97c40] c00000000015df9c .sysfs_write_file+0x134/0x1b8
[c00000000fd97cf0] c0000000000f8ec4 .vfs_write+0x118/0x200
[c00000000fd97d90] c0000000000f9634 .sys_write+0x4c/0x8c
[c00000000fd97e30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 000000000ff11138
SP (ffd1f300) is in userspace
Did some basic sanity checks and confirmed that we hit a couple of breakpoints and
the bond master was indeed bond0 as expected and the slave device being released was ib1.
After the breakpoints, we crashed
Faulting instruction address: 0xc00000000034bddc
cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
sp: c0000000e025b530
msr: 8000000000009032
dar: d000000000c6fe58
dsisr: 40000000
current = 0xc0000000e25f1aa0
paca = 0xc00000000053e280
pid = 3591, comm = ping
enter ? for help
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
sp: c0000000e025b530
msr: 8000000000009032
dar: d000000000c6fe58
dsisr: 40000000
current = 0xc0000000e25f1aa0
paca = 0xc00000000053e280
pid = 3591, comm = ping
1:mon> t
[c0000000e025b5e0] c000000000376934 .ip_output+0x358/0x3c0
[c0000000e025b670] c000000000374a04 .ip_push_pending_frames+0x440/0x558
[c0000000e025b720] c000000000397f10 .raw_sendmsg+0x770/0x860
[c0000000e025b860] c0000000003a24f8 .inet_sendmsg+0x7c/0xa8
[c0000000e025b900] c00000000033031c .sock_sendmsg+0x114/0x1b8
[c0000000e025bb00] c000000000331878 .sys_sendmsg+0x218/0x2ac
[c0000000e025bd20] c000000000356314 .compat_sys_sendmsg+0x14/0x28
[c0000000e025bd90] c000000000357914 .compat_sys_socketcall+0x1e4/0x214
[c0000000e025be30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0000000007f03c98
SP (ffb6e570) is in userspace
1:mon>
I looked at the skb and confirmed that this was indeed against bond0.
One thing is apparent at this point. ping is continuing even though bond_release()
for ib1 (and of course ib0) occurred way back!
This is the reason for the crash. Any suggestions as to how to fix this?
Pradeep
More information about the ewg
mailing list