[ofa-general] [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()

Wed Sep 19 04:54:03 PDT 2007

Hi Dave,

After applying Roland's NAPI patch, system panics when I run multiple
thread iperf (no stack trace at this time, it shows that the panic is in
net_tx_action).

I think the problem is:

In the "done < budget" case, ipoib_poll calls netif_rx_complete()
netif_rx_complete()
	__netif_rx_complete()
		__napi_complete()
			list_del()
				__list_del()
				entry->next = LIST_POISON1;
				entry->prev = LIST_POISON2;

Due to race with another completion (explained at end of the patch),
net_rx_action finds quota==0 (even though done < budget earlier):
net_rx_action()
	if (unlikely(!n->quota)) {
		n->quota = n->weight;
		list_move_tail()
			__list_del(POISON, POISON)
			<PANIC>
	}

while IPoIB calling netif_rx_reschedule() works fine due to:
netif_rx_reschedule
        __netif_rx_schedule
                __napi_schedule
                        list_add_tail (this is OK)

Patch that fixes this:

diff -ruNp a/include/linux/netdevice.h b/include/linux/netdevice.h

--- a/include/linux/netdevice.h	2007-09-19 16:50:35.000000000 +0530
+++ b/include/linux/netdevice.h	2007-09-19 16:51:28.000000000 +0530
@@ -346,7 +346,7 @@ static inline void napi_schedule(struct 
 static inline void __napi_complete(struct napi_struct *n)
 {
 	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
-	list_del(&n->poll_list);
+	__list_del(&n->poll_list);
 	smp_mb__before_clear_bit();
 	clear_bit(NAPI_STATE_SCHED, &n->state);
 }

When I apply this patch, things work fine but I get napi_check_quota_bug()
warning. This race seems to happen as follows:

CPU#1: ipoib_poll(budget=100)
{
	A. process 100 skbs
	B. netif_rx_complete()
	<Process unrelated interrupts; executes slower than steps C, D, E on
	 CPU#2>
	F. ib_req_notify_cq() (no missed completions, do nothing)
	G. return 100
	H. return to net_rx_action, quota=99, subtract 100, quota=-1, BUG.
}

CPU#2: ipoib_ib_completion() : (starts and finishes entire line of execution
				*after* step B and *before* H executes).
{
	C. New skb comes, call netif_rx_schedule; set quota=100
	D. do ipoib_poll(), process one skb, return work=1 to net_rx_action
	E. net_rx_action: set quota=99
}

The reason why both cpu's can execute poll simultaneously is because
netpoll_poll_lock() returns NULL (dev->npinfo == NULL). This results in
negative napi refcount and the warning. I verified this is the reason by
saving the original quota before calling poll (in net_tx_action) and
comparing with final after poll (before it gets updated), and it gets
changed very often in multiple thread testing (atleast 4 threads, haven't
seen with 2). In most cases, the quota becomes -1, and I have seen upto
-9 but those are rarer.

Note: during steps F-H and C-E, priv/napi is read/modified by both cpu's
	which is another bug relating to the same race.

I guess the above patch is not required if this bug (in IPoIB) is fixed?

Roland, why cannot we get rid of "poll_more" ? We will get called again
after netif_rx_reschedule, and it is cleaner to let the new execution handle
fresh completions. Is there a reason why this goto is required?

Thanks,

- KK