[ofw] CM ref counting issues...

Fab Tillier ftillier at microsoft.com
Thu Dec 17 14:10:10 PST 2009


Hefty, Sean wrote on Thu, 17 Dec 2009 at 13:53:16

>> +        default:
>> +            CL_ASSERT( p_mad->p_mad_buf->attr_id == CM_DREQ_ATTR_ID );
>> +            /* Treat as a timeout so we don't stall the state machine. */
>> +            p_mad->status => IB_WCS_TIMEOUT_RETRY_ERR;
> 
> I'm wondering about this and if it's even needed.  Changing the status
> isn't intuitive to someone reading the code, plus it hindered trying to
> debug the problem because the MAD was reported to the user with a
> timeout error, rather than canceled.  A canceled MAD indicates normal
> operation for the CM.

The idea is that a canceled DREQ should be handled identically to one that timed out.  The code delays moving the CEP into timewait until the DREQ is complete one way or another.  When you destroy the CEP, if you didn't send the DREQ, you send it.  If the DREQ was sent, you cancel it.  Once the DREQ completes, the CEP enters timewait.  You might be able to just move to timewait in __cleanup_cep and eliminate the DREQ_DESTROY state altogether.
 
> The mad status isn't used further down in the function to drive the
> state machine.  The states are driven solely based on the current state
> and the fact that a mad has failed.  I'm going to try removing this, so
> if you're aware of something that will immediately break, let me know.

Yes, removing the status assignment should be fine.  But I think entering timewait as soon as the CEP is destroyed might be cleaner, as it allows all canceled MADs to be treated the same way.

-Fab



More information about the ofw mailing list