[ofw] CM ref counting issues...

Sean Hefty sean.hefty at intel.com
Thu Dec 17 09:21:29 PST 2009


>Ahh, ok.  I don't think we get the communication established event in this case
>since that should only happen while we're in RTR, not RTS, and the QP is
>transitioned to RTS right away, isn't it?  Or do you delay the RTS transition
>until the RTU is received in WinVerbs?

winverbs transitions to RTS before sending the REP.  This way the app can
immediately respond to a received message.

>I think we need to have a better understanding of what's going on.  We're
>getting closer, but not quite there yet (at least I don't fully understand
>yet.)

The basic problem is that __cep_mad_send_cb() assumes that the mad being
processed is associated with the *current* state of the CEP.

What's observed is this:

__cep_mad_send_cb() was invoked for a mad with attr_id = 0x1300 (CM_REP_ATTR_ID)
with status 0xf (IB_WCS_CANCELED).  The current state of the cep is
CEP_STATE_DREQ_SENT.  You'll need to trace through the call for this, but the
code sees that the request was canceled, changes mad->status to timeout_retry,
then drops to processing cep state CEP_STATE_DREQ_SENT.  The assumption being
made is that the mad being processed is a timed out DREQ, so the cep is
transitioned into CEP_STATE_TIMEWAIT.  In reality, the mad was a successfully
processed REP, which was canceled when the RTU was received.

Meanwhile, the real DREQ is still outstanding.  Even if a DREP is received,
it'll be dropped because the cep is now in the wrong state, or could have exited
timewait completely.

To fix this, before processing a completed send mad, the current state of the
cep should be checked against the state that the cep was in when the mad was
sent.  If those states differ, then the send completion should simply be
discarded, as some other action is now driving the state machine.

- Sean




More information about the ofw mailing list