[openib-general] oops with multicast patches
Michael S. Tsirkin
mst at mellanox.co.il
Mon Dec 4 06:22:14 PST 2006
OK, I got back to this finally. First, I reproduced the crash again,
with spinlock debugger enabled. It seems we are looking at some use-after-free.
Next, I'll try adding the debugging patch Sean posted, and see what this gives.
> When running the test ib_mcast_full, both of the hosts (10.4.10.136-137 ) got kernel oops (see below).
> This test first restart the driver, and after that it attached to the max available multicast groups.
BUG: spinlock bad magic on CPU#1, ib_mad2/15709
Unable to handle kernel paging request at 00000001003e0107 RIP:
<ffffffff802e0280>{spin_bug+116}
PGD 75f1a067 PUD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: mst_pciconf mst_pci ib_mthca ib_umad ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod button battery ac ohci_hcd i2c_amd8111 i2c_amd756 i2c_core tg3 floppy ext3 jbd
Pid: 15709, comm: ib_mad2 Not tainted 2.6.17.9-smp #1
RIP: 0010:[<ffffffff802e0280>] <ffffffff802e0280>{spin_bug+116}
RSP: 0018:ffff810043b43ca8 EFLAGS: 00010006
RAX: 0000000000000000 RBX: 00000001003e0003 RCX: ffffffff80438e07
RDX: ffffffff80480e18 RSI: 0000000000000046 RDI: ffffffff80480e00
RBP: ffff81007a1cd840 R08: 00000000ffffffff R09: 0000000000000004
R10: 0000000100000000 R11: 0000000000000046 R12: ffff81007a1cd838
R13: 0000000000000293 R14: 0000000000000000 R15: ffffffff880732ce
FS: 00002b76b6c7e4e0(0000) GS:ffff81007df2eac0(0000) knlGS:00000000f7fd78e0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000001003e0107 CR3: 0000000075d1e000 CR4: 00000000000006e0
Process ib_mad2 (pid: 15709, threadinfo ffff810043b42000, task ffff81007d047880)
Stack: 0000000000000003 ffff81007a1cd840 ffff81007a1cd840 ffffffff802e02cd
ffff81007cfe9600 ffff81007a1cd840 ffff81007a1cd838 ffffffff8040ee4b
0000000000000246 ffffffff8807beff
Call Trace: <ffffffff802e02cd>{_raw_spin_lock+28} <ffffffff8040ee4b>{_spin_lock_irqsave+11}
<ffffffff8807beff>{:ib_sa:release_group+27} <ffffffff8807c903>{:ib_sa:mcast_work_handler+1280}
<ffffffff802232bc>{find_busiest_group+304} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
<ffffffff8807b8c3>{:ib_sa:ib_sa_mcmember_rec_callback+64}
<ffffffff8040eef7>{_spin_unlock_irq+7} <ffffffff8040d976>{thread_return+100}
<ffffffff8807bac4>{:ib_sa:send_handler+74} <ffffffff8807345b>{:ib_mad:timeout_sends+397}
<ffffffff80238e94>{run_workqueue+161} <ffffffff80238ede>{worker_thread+0}
<ffffffff8023be88>{keventd_create_kthread+0} <ffffffff80238fe3>{worker_thread+261}
<ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
<ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
<ffffffff8023be5f>{kthread+200} <ffffffff8020a6aa>{child_rip+8}
<ffffffff8023be88>{keventd_create_kthread+0} <ffffffff8023bd97>{kthread+0}
<ffffffff8020a6a2>{child_rip+0}
Code: 44 8b 83 04 01 00 00 48 8d 8b a0 02 00 00 8b 55 04 41 89 c1
RIP <ffffffff802e0280>{spin_bug+116} RSP <ffff810043b43ca8>
CR2: 00000001003e0107
<3>BUG: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1
Call Trace: <ffffffff80221da2>{__might_sleep+190} <ffffffff80236033>{blocking_notifier_call_chain+31}
<ffffffff8022c29a>{do_exit+34} <ffffffff8040ee4b>{_spin_lock_irqsave+11}
<ffffffff802ee041>{vgacon_set_cursor_size+51} <ffffffff80410fdf>{do_page_fault+1852}
<ffffffff880614b7>{:ib_core:ib_ud_header_pack+135} <ffffffff880b1420>{:ib_mthca:build_mlx_header+464}
<ffffffff880732ce>{:ib_mad:timeout_sends+0} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
<ffffffff8020a4f1>{error_exit+0} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
<ffffffff802e0280>{spin_bug+116} <ffffffff802e026d>{spin_bug+97}
<ffffffff802e02cd>{_raw_spin_lock+28} <ffffffff8040ee4b>{_spin_lock_irqsave+11}
<ffffffff8807beff>{:ib_sa:release_group+27} <ffffffff8807c903>{:ib_sa:mcast_work_handler+1280}
<ffffffff802232bc>{find_busiest_group+304} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
<ffffffff8807b8c3>{:ib_sa:ib_sa_mcmember_rec_callback+64}
<ffffffff8040eef7>{_spin_unlock_irq+7} <ffffffff8040d976>{thread_return+100}
<ffffffff8807bac4>{:ib_sa:send_handler+74} <ffffffff8807345b>{:ib_mad:timeout_sends+397}
<ffffffff80238e94>{run_workqueue+161} <ffffffff80238ede>{worker_thread+0}
<ffffffff8023be88>{keventd_create_kthread+0} <ffffffff80238fe3>{worker_thread+261}
<ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
<ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
<ffffffff8023be5f>{kthread+200} <ffffffff8020a6aa>{child_rip+8}
<ffffffff8023be88>{keventd_create_kthread+0} <ffffffff8023bd97>{kthread+0}
<ffffffff8020a6a2>{child_rip+0}
Triggering sysrq show that this was during module removal:
modprobe D ffff810055b1be48 0 22388 1 4048 (NOTLB)
ffff810055b1be48 0000000155b1bdc0 ffff81007a1c9880 000000000000050e
000002b14226ffda ffff81007c1888c0 0000000000000001 0000000155b1be88
ffff81007a1c9880 0000000000001707
Call Trace: <ffffffff8040dae3>{wait_for_completion+229}
<ffffffff8040daa3>{wait_for_completion+165} <ffffffff80223e8f>{default_wake_functio
n+0}
<ffffffff80223e8f>{default_wake_function+0} <ffffffff8807cefe>{:ib_sa:mcast_cleanup
+25}
<ffffffff8807cf12>{:ib_sa:ib_sa_cleanup+6} <ffffffff80241141>{sys_delete_module+411
}
<ffffffff802dce63>{__up_write+20} <ffffffff8025c501>{sys_munmap+91}
<ffffffff8020961a>{system_call+126}
--
Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd.
More information about the general
mailing list