[PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun

Michael S. Tsirkin mst at mellanox.co.il
Mon Dec 20 08:16:28 PST 2004


Hello!
Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun":
>     Michael> CQ consumer index doorbell FW is reasonably well tested
>     Michael> with VAPI (and with directed tests). It is also
>     Michael> relatively straight-forward code so I would suspect a
>     Michael> driver problem first of all.
> 
> Fair enough but the behavior changed from FW version 3.2 to 3.3.1
> which is interesting as well.

I know but races are always tricky, could be just a timing issue.
Its just that CI doorbells are routinely stressed here by QA.

>     Michael> Unfortunately once the overrun happends I can not bring
>     Michael> the interface down nor unload the ip over ib module (both
>     Michael> commands hang) so I have to reboot. This is slowing me
>     Michael> down considerably.  Do you have an idea why is that, and
>     Michael> how to fix this problem?
> 
> Probably IPoIB is stuck in the loop
> 
> 	/* Wait for all sends and receives to complete */
> 	while (priv->tx_head != priv->tx_tail || recvs_pending(dev))
> 		yield();
> 
> in ipoib_ib_dev_stop(), since some of completions it's waiting for are
> lost because of the CQ overrun.  I'll add a timeout here where we give
> up and assume everything is done.
> 
>  - R.

Let me know when you do. But why wait?
Once you close the CQ, and get the command interface event
of the hw2sw cq, it is guaranteed you wont get any new cqes
or events on this cq.

Alternatively, you can do a query cq to check its not in overrun,
although it seems like working around a specific problem we see.

MST



More information about the general mailing list