[ofa-general] IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!

Hoang-Nam Nguyen hnguyen at linux.vnet.ibm.com
Wed Feb 28 04:50:03 PST 2007


Hi,
I also have seen this when high traffic happens bidirectionally between two
nodes and 4 links (ppc64, ehca on 2.6.20) through ipoib. Here is a snippet
of backtraces:

BUG: soft lockup detected on CPU#23!
Call Trace:
[C00000000F5DB470] [C00000000000FC8C] .show_stack+0x5c/0x1cc (unreliable)
[C00000000F5DB520] [C00000000008731C] .softlockup_tick+0x114/0x14c
[C00000000F5DB5E0] [C000000000063210] .run_local_timers+0x1c/0x30
[C00000000F5DB660] [C000000000024244] .timer_interrupt+0xec/0x504
[C00000000F5DB750] [C000000000003570] decrementer_common+0xf0/0x100
--- Exception: 901 at .tcp_v4_rcv+0x964/0xd04
    LR = .tcp_v4_rcv+0x938/0xd04
[C00000000F5DBB30] [C00000000035A328] .ip_local_deliver+0x1ac/0x400
[C00000000F5DBBC0] [C000000000359B04] .ip_rcv+0x378/0x690
[C00000000F5DBC70] [C00000000032D5EC] .netif_receive_skb+0x550/0x574
[C00000000F5DBD20] [C00000000032D718] .process_backlog+0x108/0x250
[C00000000F5DBE00] [C00000000032B434] .net_rx_action+0x198/0x2f4
[C00000000F5DBED0] [C00000000005CB58] .__do_softirq+0xd8/0x1a0
[C00000000F5DBF90] [C00000000002761C] .call_do_softirq+0x14/0x24
[C0000003B4E23BA0] [C00000000000CE68] .do_softirq+0xb4/0xc0
[C0000003B4E23C30] [C00000000032DC78] .netif_rx_ni+0x58/0x78
[C0000003B4E23CB0] [D00000000013F638] .ipoib_ib_completion+0x2a4/0x6dc [ib_ipoib]
[C0000003B4E23DB0] [D00000000069EB94] .comp_task+0x340/0x424 [ib_ehca]
[C0000003B4E23ED0] [C00000000007338C] .kthread+0x170/0x1c0
[C0000003B4E23F90] [C0000000000277D8] .kernel_thread+0x4c/0x68

Above trace occurred on all 32 cpus multiple times.
Reason is that the kernel timer tick did not get the cpu after 10 secs
(see kernel/softlockup.c), since ipoib_ib_completion() seemed to be polling
cq in high rate. The following patch would help:

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index f2aa923..97ea26f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -301,6 +301,7 @@ void ipoib_ib_completion(struct ib_cq *c
 		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc);
 		for (i = 0; i < n; ++i)
 			ipoib_ib_handle_wc(dev, priv->ibwc + i);
+		cond_resched();
 	} while (n == IPOIB_NUM_WC);
 }

However I still saw that BUG trace occurred on 3-4 cpus after several hrs. 
I should also mention that the systems are still functional.

Regards
Nam





More information about the general mailing list