[ofa-general] ipoib crashes with 2.6.27-rc7

Or Gerlitz ogerlitz at voltaire.com
Mon Sep 22 01:59:06 PDT 2008


Attempting to set an  ipoib / partitioning  bonding environment with 
2.6.27-rc7 , I came a cross few ipoib crashes, eg these two oops 
listings.  I understand that some patches were sent by Yossi just 
recently so they may help, or do they fall into the 
non-regression-from-2.6.26 category?

Or.

this is seen on node startup
> mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
> NET: Registered protocol family 10
> lo: Disabled Privacy Extensions
> ADDRCONF(NETDEV_UP): ib0.8003: link is not ready
> ------------[ cut here ]------------
> kernel BUG at include/linux/netdevice.h:415!
> invalid opcode: 0000 [1] SMP CPU 7
> Modules linked in: rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm
> ib_sa inet_lro ipv6 ib_uverbs ib_umad mlx4_ib ib_mthca ib_mad ib_core
> dm_multipath battery ac floppy sr_mod joydev sg igb mlx4_core shpchp
> button pcspkr rng_core dm_snapshot dm_zero dm_mirror dm_log dm_mod
> usb_storage ata_piix libata sd_mod scsi_mod dock ext3 jbd ehci_hcd
> ohci_hcd uhci_hcd [last unloaded: microcode]
> Pid: 3035, comm: ipoib Not tainted 2.6.27-rc7 #2
> RIP: 0010:[<ffffffffa01f364c>]  [<ffffffffa01f364c>] ipoib_open+0x3c/0x150
> [ib_ipoib]
> RSP: 0018:ffff880229d15e90  EFLAGS: 00010246
> RAX: ffff88021f00a878 RBX: ffff88021f00a7a0 RCX: 0000000000000000
> RDX: 0003000600000000 RSI: ffff88022e029880 RDI: ffff88021f00a000
> RBP: ffff88021f00a780 R08: 0000000000000000 R09: ffffffff805a8e40
> R10: 0000000000000000 R11: 0000000000000003 R12: ffff88021f00a000
> R13: ffffffffa01f4af2 R14: ffffffff805e32c0 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff88022f826580(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000000008cb170 CR3: 000000022e5d2000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process ipoib (pid: 3035, threadinfo ffff880229d14000, task ffff88022e195f00)
> Stack:  ffff88021f00a878 ffff88022d02c780 ffff88021f00a870 ffffffff8023fd92
>  ffff88022c531d18 ffff88022d02c780 ffff88022d02c7a8 ffff88022c531d18
>  ffffffff805e0e80 ffffffff80240700 0000000000000000 ffff88022e195f00
> Call Trace:
>  [<ffffffff8023fd92>] ? run_workqueue+0x88/0x118
>  [<ffffffff80240700>] ? worker_thread+0xd5/0xe0
>  [<ffffffff80242f41>] ? autoremove_wake_function+0x0/0x2e
>  [<ffffffff8024062b>] ? worker_thread+0x0/0xe0
>  [<ffffffff80242e38>] ? kthread+0x47/0x73
>  [<ffffffff8022d2e4>] ? schedule_tail+0x28/0x60
>  [<ffffffff8020c179>] ? child_rip+0xa/0x11
>  [<ffffffff80242df1>] ? kthread+0x0/0x73
>  [<ffffffff8020c16f>] ? child_rip+0x0/0x11
>
>
> Code: 07 00 00 53 7e 12 48 8b 75 18 48 c7 c7 ff c5 1f a0 31 c0 e8 e7 eb 03
> e0 41 f6 84 24 b0 07 00 00 01 49 8d 9c 24 a0 07 00 00 75 04 <0f> 0b eb fe
> f0 80 63 10 fe f0 80 8d 80 00 00 00 04 4c 89 e7 e8
> RIP  [<ffffffffa01f364c>] ipoib_open+0x3c/0x150 [ib_ipoib]
>  RSP <ffff880229d15e90>
> ---[ end trace d51c7bec8b19b076 ]---

and this takes place when you attempt to take ib0 down in the presence 
of child devices which are not running, if there are
no child devices it doesn't happen

> ib0.8003: Failed to modify QP to ERROR state
> BUG: soft lockup - CPU#0 stuck for 61s! [ifconfig:7481]
> CPU 0:
> Modules linked in: autofs4 sunrpc ib_iser iscsi_tcp libiscsi scsi_transport_iscsi bonding rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa inet_lro ipv6 ib_uverbs ib_umad mlx4_ib ib_mthca ib_mad ib_core dm_multipath battery ac floppy sr_mod igb joydev mlx4_core shpchp sg button pcspkr rng_core dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod dock ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode]
> Pid: 7481, comm: ifconfig Tainted: G      D   2.6.27-rc7 #2
> RIP: 0010:[<ffffffff80239a3e>]  [<ffffffff80239a3e>] lock_timer_base+0x15/0x4b
> RSP: 0018:ffff880213d75c28  EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
> RDX: 0000000000001800 RSI: ffff880213d75c68 RDI: ffff880222cb94d0
> RBP: ffff880222cb8000 R08: 0000000000000100 R09: ffff8800280bb900
> R10: 0000000000000000 R11: ffffffff8031c680 R12: ffff880222cb8780
> R13: ffff880222cb8780 R14: ffff880222cb87a0 R15: ffff88002805cf00
> FS:  00007f7f380fc710(0000) GS:ffffffff805a9a80(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007f52433af000 CR3: 000000021c5cf000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>
> Call Trace:
>  [<ffffffff80239a8a>] ? try_to_del_timer_sync+0x16/0x5a
>  [<ffffffff80239ada>] ? del_timer_sync+0xc/0x16
>  [<ffffffffa01f44ed>] ? ipoib_ib_dev_stop+0x190/0x26d [ib_ipoib]
>  [<ffffffff80459c81>] ? _spin_lock_irqsave+0x9/0xe
>  [<ffffffff80239a4f>] ? lock_timer_base+0x26/0x4b
>  [<ffffffff8022ad25>] ? default_wake_function+0x0/0xe
>  [<ffffffff80459c69>] ? _spin_unlock_irq+0x9/0xc
>  [<ffffffffa01f23ca>] ? ipoib_flush_paths+0x13a/0x145 [ib_ipoib]
>  [<ffffffffa01f2ab0>] ? ipoib_stop+0x7e/0xf8 [ib_ipoib]
>  [<ffffffff803e5553>] ? dev_close+0x6f/0x87
>  [<ffffffff803e5261>] ? dev_change_flags+0xa6/0x15c
>  [<ffffffffa01f2aea>] ? ipoib_stop+0xb8/0xf8 [ib_ipoib]
>  [<ffffffff803e5553>] ? dev_close+0x6f/0x87
>  [<ffffffff803e5261>] ? dev_change_flags+0xa6/0x15c
>  [<ffffffff80424b68>] ? devinet_ioctl+0x242/0x58a
>  [<ffffffff803db45d>] ? sock_ioctl+0x1d2/0x1f9
>  [<ffffffff80291e31>] ? vfs_ioctl+0x21/0x6b
>  [<ffffffff802920d4>] ? do_vfs_ioctl+0x259/0x272
>  [<ffffffff8029213e>] ? sys_ioctl+0x51/0x73





More information about the general mailing list