[ofa-general] IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!

Roland Dreier rdreier at cisco.com
Wed Feb 28 08:42:41 PST 2007


I guess the solution is to merge IPoIB NAPI to avoid overloading the
system with interrupts.  I'll fix up a few last things with my NAPI
patch and we can try to get it in shape to merge for 2.6.22.

 > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 > index f2aa923..97ea26f 100644
 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 > @@ -301,6 +301,7 @@ void ipoib_ib_completion(struct ib_cq *c
 >  		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc);
 >  		for (i = 0; i < n; ++i)
 >  			ipoib_ib_handle_wc(dev, priv->ibwc + i);
 > +		cond_resched();

obviously this is wrong because ipoib_ib_completion() is not
necessarily called in process context (in fact the ehca scaling hack
is probably the only driver that does call it when it's safe to
reschedule).

 >  	} while (n == IPOIB_NUM_WC);
 >  }
 > 
 > However I still saw that BUG trace occurred on 3-4 cpus after several hrs. 

Right, because this patch is not really doing anything to reduce the
interrupt load.



More information about the general mailing list