[ofa-general] IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!

Tziporet Koren tziporet at mellanox.co.il
Wed Feb 28 02:01:32 PST 2007


Hi Roland,

When running stress tests over IPoIB CM a kernel bug occurred (with 
kernel 2.6.20):

Feb 27 17:47:52 sw169 kernel: BUG: soft lockup detected on CPU#0!
Feb 27 17:47:52 sw169 kernel:
Feb 27 17:47:52 sw169 kernel: Call Trace:
Feb 27 17:47:52 sw169 kernel:  <IRQ>  [<ffffffff80252eef>] 
softlockup_tick+0xd2/0xe4
Feb 27 17:47:52 sw169 kernel:  [<ffffffff802399a4>] 
update_process_times+0x42/0x68
Feb 27 17:47:52 sw169 kernel:  [<ffffffff80218260>] 
smp_local_timer_interrupt+0x31/0x52
Feb 27 17:47:52 sw169 kernel:  [<ffffffff802182d0>] 
smp_apic_timer_interrupt+0x4f/0x66
Feb 27 17:47:52 sw169 kernel:  [<ffffffff8020a166>] 
apic_timer_interrupt+0x66/0x70
Feb 27 17:47:52 sw169 kernel:  [<ffffffff8053aaf1>] 
_spin_lock_irqsave+0x15/0x24
Feb 27 17:47:52 sw169 kernel:  [<ffffffff88067a23>] 
:ib_ipoib:ipoib_neigh_destructor+0xc2/0x139
Feb 27 17:47:52 sw169 kernel:  [<ffffffff804c216b>] neigh_destroy+0xc2/0x10e
Feb 27 17:47:52 sw169 kernel:  [<ffffffff804c198c>] dst_destroy+0x5f/0xd6
Feb 27 17:47:52 sw169 kernel:  [<ffffffff804c1a6f>] dst_run_gc+0x6c/0x12a
Feb 27 17:47:52 sw169 kernel:  [<ffffffff804c1a03>] dst_run_gc+0x0/0x12a
Feb 27 17:47:52 sw169 kernel:  [<ffffffff802398fe>] 
run_timer_softirq+0x14f/0x1a0
Feb 27 17:47:52 sw169 kernel:  [<ffffffff80235d89>] __do_softirq+0x50/0xbb
Feb 27 17:47:52 sw169 kernel:  [<ffffffff8020a6bc>] call_softirq+0x1c/0x28
Feb 27 17:47:52 sw169 kernel:  [<ffffffff8020bb0f>] do_softirq+0x2e/0x97
Feb 27 17:47:52 sw169 kernel:  [<ffffffff802182d5>] 
smp_apic_timer_interrupt+0x54/0x66
Feb 27 17:47:52 sw169 kernel:  [<ffffffff80207e66>] mwait_idle+0x0/0x42
Feb 27 17:47:52 sw169 kernel:  [<ffffffff8020a166>] 
apic_timer_interrupt+0x66/0x70
Feb 27 17:47:52 sw169 kernel:  <EOI>  [<ffffffff80207ea5>] 
mwait_idle+0x3f/0x42
Feb 27 17:47:52 sw169 kernel:  [<ffffffff80207e01>] cpu_idle+0x8b/0xae
Feb 27 17:47:52 sw169 kernel:  [<ffffffff80744709>] start_kernel+0x212/0x214
Feb 27 17:47:52 sw169 kernel:  [<ffffffff80744175>] _sinittext+0x175/0x179


To reproduce:
Need 2 machines back2back (A and B), and opensm installed on machine B.
On A machine run:
ping B (its ib0 address)

On machine B:
Copy scripts from http://www.openfabrics.org/~tziporet/ipoib_scripts/ <http://www.openfabrics.org/%7Etziporet/ipoib_scripts/> to a
local directory and edit them to include the correct ib0 IP address of machine A.

Run: runscripts.sh 

Tziporet




More information about the general mailing list