[ofa-general] Re: [PATCH for-2.6.21] mthca: QP reset race fixup

Tue Mar 20 22:18:03 PDT 2007

> Quoting Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH for-2.6.21] mthca: QP reset race fixup
> 
>  > This fixes openfabrics bugzilla 394:
>  > - Use common EQ for command interface and async events
>  > - Clean CQ after moving QP to reset
> 
> This is a little terse -- an ideal changelog entry would explain what
> the bug is, what is being changed to fix it, and why that fixes the issue.
> I'll try to fix it up...
> 
>  > This also fixes a potential crash in ipoib cm:
>  > - sync with completion event ISR after QP is reset
>  >   to prevent ULP from getting and using QP pointer and/or WRID
>  >   after they are freed.
> 
>  > We can rip the lines that sync with MTHCA_EQ_COMP out if you think
>  > the issue needs to be dealt with in some other way - and I agree
>  > this is only good for ULPs that do all their polling
>  > inside the ISR, but at least this covers all in-kernel code.
> 
> I don't really like this change, although maybe it's the right thing
> to do.  But can you explain what IPoIB CM is doing that would cause it
> to run into trouble?  I'd like to see if there's a better solution.
> It just seems strange to me to add the assumption that destroying a QP
> makes sure that all running CQ callbacks are done.

Look at ipoib_cm_stale_task:
+               ib_destroy_cm_id(p->id);
+               ib_destroy_qp(p->qp);

and then
+       if (!likely(wr_id & IPOIB_CM_RX_UPDATE_MASK)) {
+               p = wc->qp->qp_context;

This wc->qp->qp_context might use QP after free.

> 
> If we change to NAPI (so that CQs are polled asynchronously) does that
> readd the same bug?

Hmm. Yes.
In hindsight, it was probably better to put qp_context directly in ib_wc
instead of the qp pointer.

Then ipoib could set some flag in the structure pointed to by qp_context.

My guess this would be too big a change for 2.6.21. What do you think?

-- 
MST