[ofw] CM ref counting issues...

Fab Tillier ftillier at microsoft.com
Thu Dec 17 12:53:02 PST 2009


Sean Hefty wrote on Thu, 17 Dec 2009 at 09:21:29

> What's observed is this:
> 
> __cep_mad_send_cb() was invoked for a mad with attr_id = 0x1300
> (CM_REP_ATTR_ID) with status 0xf (IB_WCS_CANCELED).  The current state
> of the cep is CEP_STATE_DREQ_SENT.  You'll need to trace through the
> call for this, but the code sees that the request was canceled, changes
> mad->status to timeout_retry, then drops to processing cep state
> CEP_STATE_DREQ_SENT.  The assumption being made is that the mad being
> processed is a timed out DREQ, so the cep is transitioned into
> CEP_STATE_TIMEWAIT.  In reality, the mad was a successfully processed
> REP, which was canceled when the RTU was received.

Ok, got it.

> Meanwhile, the real DREQ is still outstanding.  Even if a DREP is
> received, it'll be dropped because the cep is now in the wrong state, or
> could have exited timewait completely.
> 
> To fix this, before processing a completed send mad, the current state
> of the cep should be checked against the state that the cep was in when
> the mad was sent.  If those states differ, then the send completion
> should simply be discarded, as some other action is now driving the
> state machine.

The only MADs that can be canceled are those that get retried: REQ, REP, LAP, and DREQ.  Of these, the only one that needs some action when it gets canceled is the DREQ when the CEP has been destroyed.

Does the following patch work for you?  I haven't tested it (not even compiled, sorry.)

Signed-off-by: Fab Tillier <ftillier at microsoft.com>

Index: al_cm_cep.c
===================================================================
--- al_cm_cep.c	(revision 2642)
+++ al_cm_cep.c	(working copy)
@@ -2239,21 +2239,20 @@ __cep_mad_send_cb(
 		break;
 
 	case IB_WCS_CANCELED:
-		if( p_cep->state != CEP_STATE_REQ_SENT &&
-			p_cep->state != CEP_STATE_REQ_MRA_RCVD &&
-			p_cep->state != CEP_STATE_REP_SENT &&
-			p_cep->state != CEP_STATE_REP_MRA_RCVD &&
-			p_cep->state != CEP_STATE_LAP_SENT &&
-			p_cep->state != CEP_STATE_LAP_MRA_RCVD &&
-			p_cep->state != CEP_STATE_DREQ_SENT &&
-			p_cep->state != CEP_STATE_SREQ_SENT )
-		{
+        switch( p_mad->p_mad_buf->attr_id )
+        {
+        case CM_REQ_ATTR_ID:
+        case CM_REP_ATTR_ID:
+        case CM_LAP_ATTR_ID:
 			KeReleaseInStackQueuedSpinLockFromDpcLevel( &hdl );
 			ib_put_mad( p_mad );
-			break;
+			goto done;
+
+        default:
+            CL_ASSERT( p_mad->p_mad_buf->attr_id == CM_DREQ_ATTR_ID );
+            /* Treat as a timeout so we don't stall the state machine. */
+            p_mad->status = IB_WCS_TIMEOUT_RETRY_ERR;
 		}
-		/* Treat as a timeout so we don't stall the state machine. */
-		p_mad->status = IB_WCS_TIMEOUT_RETRY_ERR;
 
 		/* Fall through. */
 	case IB_WCS_TIMEOUT_RETRY_ERR:



More information about the ofw mailing list