[PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun

Roland Dreier roland at topspin.com
Mon Dec 20 08:41:17 PST 2004


    Michael> I know but races are always tricky, could be just a
    Michael> timing issue.  Its just that CI doorbells are routinely
    Michael> stressed here by QA.

The thing that really makes it hard to for to think of a potential
driver problem is that changing from updating the CI all at once to
updating it by 1 at a time in a loop fixes things for me.  If anything
this lengthens the amount of time during which the CQ has too little
space.

Also adding a 1000 extra entries to the CQ created by IPoIB -- ie
changing the code in ipoib_verbs.c to

	priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev,
				IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1000);

still has the same problem, so we're not just transiently overrunning
by 1 or something like that -- it looks like we're systematically
losing updates to the CQ.

    Michael> Let me know when you do. But why wait?  Once you close
    Michael> the CQ, and get the command interface event of the hw2sw
    Michael> cq, it is guaranteed you wont get any new cqes or events
    Michael> on this cq.

OK, it's done.  The reason for the wait here is that we are actually
cleaning up the QP and want to make sure that we don't leak any
resources.  First we transition the QP to error, wait for all work
requests to complete, and then transition the QP to reset.

 - Roland



More information about the general mailing list