[PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun

Michael S. Tsirkin mst at mellanox.co.il
Mon Dec 20 08:46:01 PST 2004


Hello!
Quoting r. Roland Dreier (roland at topspin.com) "Re: [PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun":
>     Michael> I know but races are always tricky, could be just a
>     Michael> timing issue.  Its just that CI doorbells are routinely
>     Michael> stressed here by QA.
> 
> The thing that really makes it hard to for to think of a potential
> driver problem is that changing from updating the CI all at once to
> updating it by 1 at a time in a loop fixes things for me.  If anything
> this lengthens the amount of time during which the CQ has too little
> space.
> 
> Also adding a 1000 extra entries to the CQ created by IPoIB -- ie
> changing the code in ipoib_verbs.c to
> 
> 	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev,
> 				IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1000);
> 
> still has the same problem, so we're not just transiently overrunning
> by 1 or something like that -- it looks like we're systematically
> losing updates to the CQ.
> 
>     Michael> Let me know when you do. But why wait?  Once you close
>     Michael> the CQ, and get the command interface event of the hw2sw
>     Michael> cq, it is guaranteed you wont get any new cqes or events
>     Michael> on this cq.
> 
> OK, it's done.  The reason for the wait here is that we are actually
> cleaning up the QP and want to make sure that we don't leak any
> resources.  First we transition the QP to error, wait for all work
> requests to complete, and then transition the QP to reset.
> 
>  - Roland

But why wait for completion? Once QP is in error no new WQEs will
be processed by hardware. You can close the CQ and free all of them.

MST



More information about the general mailing list