[ewg] Crash in bonding

Shiri Franchi shirif at voltaire.com
Tue Nov 3 04:57:34 PST 2009


Hi,

I tried to reproduce on RH5 up4 with ping and iperf and it did not
happened.
Are you sure you used "modprobe -t ib_ipoib" or maybe modprobe -r
bonding?

Thanks,
Shiri





On Mon, 2009-11-02 at 14:41 -0800, Pradeep Satyanarayana wrote:
> This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. 
> The steps to recreate the crash are as follows:
> 
> 1. Run traffic (I used ping) on the IB interfaces through the bond master
> 2. ifdown ib0
> 3. ifdown ib1
> 4. modprobe -r ib_ipoib
> 
> Quite often, the crash stack trace seen is as follows:
> 
> ID: 0      TASK: ffff81087fc11820  CPU: 13  COMMAND: "swapper"
>  #0 [ffff81010ff07ab0] crash_kexec at ffffffff800ac5b9
>  #1 [ffff81010ff07b70] __die at ffffffff80065127
>  #2 [ffff81010ff07bb0] do_page_fault at ffffffff80066da7
>  #3 [ffff81010ff07ca0] error_exit at ffffffff8005dde9
>  #4 [ffff81010ff07d58] neigh_connected_output at ffffffff8022cb87
>  #5 [ffff81010ff07d88] ip_output at ffffffff800320ac
>  #6 [ffff81010ff07db8] ip_queue_xmit at ffffffff8003464d
>  #7 [ffff81010ff07e78] tcp_transmit_skb at ffffffff80021d73
>  #8 [ffff81010ff07ec8] tcp_retransmit_skb at ffffffff80250ccd
>  #9 [ffff81010ff07f08] tcp_write_timer at ffffffff80252652
> #10 [ffff81010ff07f28] run_timer_softirq at ffffffff800968be
> #11 [ffff81010ff07f58] __do_softirq at ffffffff8001235a
> #12 [ffff81010ff07f88] call_softirq at ffffffff8005e2fc
> #13 [ffff81010ff07fa0] do_softirq at ffffffff8006cb14
> #14 [ffff81010ff07fb0] apic_timer_interrupt at ffffffff8005dc8e
> --- <IRQ stack> ---
> #15 [ffff81010ff03e48] apic_timer_interrupt at ffffffff8005dc8e
>     [exception RIP: mwait_idle+54]
>     RIP: ffffffff800571f4  RSP: ffff81010ff03ef0  RFLAGS: 00000246
>     RAX: 0000000000000000  RBX: 000000000000000d  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff80301698
>     RBP: ffff81087fc11a10   R8: ffff81010ff02000   R9: 0000000000000032
>     R10: ffff81048e0cc4f0  R11: ffff8103ebafcd18  R12: 0000000005f33f4d
>     R13: 00000d12e63d7223  R14: ffff81047fe797a0  R15: ffff81087fc11820
>     ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
> #16 [ffff81010ff03ef0] cpu_idle at ffffffff8004939e
> 
> 
> 
> I was able to set up some break points and the analysis follows.
> 
> cpu 0x1 stopped at breakpoint 0x1 (d000000000ec4214 .bond_release+0x0/0x4d0 [bonding])
>         mflr    r0
> enter ? for help
> 1:mon> t
> [link register   ] d000000000ecdf80 .bonding_store_slaves+0x304/0x3f0 [bonding]
> [c00000000fd97b00] d000000000ecdf70 .bonding_store_slaves+0x2f4/0x3f0 [bonding] (unreliable)
> [c00000000fd97bd0] c00000000029a660 .class_device_attr_store+0x44/0x60
> [c00000000fd97c40] c00000000015df9c .sysfs_write_file+0x134/0x1b8
> [c00000000fd97cf0] c0000000000f8ec4 .vfs_write+0x118/0x200
> [c00000000fd97d90] c0000000000f9634 .sys_write+0x4c/0x8c
> [c00000000fd97e30] c0000000000086a4 syscall_exit+0x0/0x40
> --- Exception: c00 (System Call) at 000000000ff11138
> SP (ffd1f300) is in userspace
> 
> Did some basic sanity checks and confirmed that we hit a couple of breakpoints and
> the bond master was indeed bond0 as expected and the slave device being released was ib1.
> After the breakpoints, we crashed 
> 
> 
> Faulting instruction address: 0xc00000000034bddc
> cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
>     pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
>     lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
>     sp: c0000000e025b530
>    msr: 8000000000009032
>    dar: d000000000c6fe58
>  dsisr: 40000000
>   current = 0xc0000000e25f1aa0
>   paca    = 0xc00000000053e280
>     pid   = 3591, comm = ping
> enter ? for help
> 1:mon> e
> cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
>     pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
>     lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
>     sp: c0000000e025b530
>    msr: 8000000000009032
>    dar: d000000000c6fe58
>  dsisr: 40000000
>   current = 0xc0000000e25f1aa0
>   paca    = 0xc00000000053e280
>     pid   = 3591, comm = ping
> 1:mon> t
> [c0000000e025b5e0] c000000000376934 .ip_output+0x358/0x3c0
> [c0000000e025b670] c000000000374a04 .ip_push_pending_frames+0x440/0x558
> [c0000000e025b720] c000000000397f10 .raw_sendmsg+0x770/0x860
> [c0000000e025b860] c0000000003a24f8 .inet_sendmsg+0x7c/0xa8
> [c0000000e025b900] c00000000033031c .sock_sendmsg+0x114/0x1b8
> [c0000000e025bb00] c000000000331878 .sys_sendmsg+0x218/0x2ac
> [c0000000e025bd20] c000000000356314 .compat_sys_sendmsg+0x14/0x28
> [c0000000e025bd90] c000000000357914 .compat_sys_socketcall+0x1e4/0x214
> [c0000000e025be30] c0000000000086a4 syscall_exit+0x0/0x40
> --- Exception: c00 (System Call) at 0000000007f03c98
> SP (ffb6e570) is in userspace
> 1:mon>
> 
> I looked at the skb and confirmed that this was indeed against bond0.
> 
> One thing is apparent at this point. ping is continuing even though bond_release()
> for ib1 (and of course ib0) occurred way back!
> 
> This is the reason for the crash. Any suggestions as to how to fix this?
> 
> Pradeep
> 
> 
> 
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg



More information about the ewg mailing list