[ofw] CM ref counting issues...

Fab Tillier ftillier at microsoft.com
Wed Dec 16 21:46:05 PST 2009


Sean Hefty wrote on Wed, 16 Dec 2009 at 21:38:30

>> This is showing that DREQs were sent for CEPs.  The CEPs were
>> 'destroyed', but the DREQs were not canceled.  So, destruction of the
>> CEPs hung until the DREQs timed out.
> 
> I was finally able to track down what appears to be the root of the
> problems. The __cep_mad_send_cb() handling of failed requests is
> incorrect.  The callback assumes that the mad being processed is
> associated with the current state of the CEP, which may not be the case.
> 
> For example, in my testing, a REP mad was completed as canceled;

A REP?  If a REP times out, why aren't you ending up sending a REJ and aborting the connection?

> however, by the time the callback was invoked, the connection had
> already transitioned to the DREQ_SENT state.  The result was that the
> state machine transitioned into timewait, leaving the DREQ outstanding.

Not sure I quite follow here... The DREQ_SENT state should have invoked the callback.

> The implementation of the state machine in the CM code is subtle, but I
> think the following untested change could fix the issue in all cases:
> 
> 	KeAcquireInStackQueuedSpinLockAtDpcLevel( &gp_cep_mgr->lock, &hdl );
> 	/* Clear the sent MAD pointer so that we don't try cancelling again. */
> 	if( p_cep->p_send_mad == p_mad )
> 		p_cep->p_send_mad = NULL;

If you're going to put curlies around the else, please put them on the if too.  Makes the code easier to read, and consistent with the style of the rest of the codebase.

> +	else
> +	{
> +		KeReleaseInStackQueuedSpinLockFromDpcLevel( &hdl );
> +		ib_put_mad( p_mad );
> +	}

Are you going to skip the switch statement on the MAD status then?  If so, don't forget to release the reference on the CEP held by the MAD.  Seems like you're missing a 'goto done;' here.

> If this works, it seems like the simplest fix.  The linux CM handled
> this case by recording the state that a mad was sent in, then checked
> that the cep was in the same state when it completed.  If the states
> differed, then the mad was discarded.  This is another possibility if
> the above doesn't work.
> 
> - Sean





More information about the ofw mailing list