[ofw] IBAL CEP reference counting is... interesting

Fri Jan 9 14:44:23 PST 2009

>From the code:
>/* Number of outstanding MADs.  Delays destruction of CEP destruction. */
>atomic32_t                                      ref_cnt;
>
>So it's not a ref count to manage user references.  All user calls perform a
>lookup based on input CID.  The MADs reference the CEP based on the context set
>in the send.

See __create_cep():

	/*
	 * Pre-charge the reference count to 1.  The code will invoke the
	 * destroy callback once the ref count reaches to zero.
	 */
	p_cep->ref_cnt = 1;

and __cleanup_cep():

	return cl_atomic_dec( &p_cep->ref_cnt );

So, it's not a mad count either... maybe the fix is to change the name to
outstanding_mads and initialize it to 0.  I have no idea if such a simple change
will work, but I can look into it.  This should make the code cleaner, but won't
fix the hang problem.

>> For large clusters, the CM timeout can be huge.  My idea to fix this
>> was to have the DREQ sent once without being tied to the cep if
>> initiated from the destroy call.  Comments?
>
>Why not cancel the DREQ in __cleanup_cep if you're in the DREQ_SENT state when
>destroying?  Any MAD that is sent via __cep_send_retry (any MAD that is retried
>until the CEP manager cancels it) sets p_cep->p_send_mad.  Use that to cancel
>in __cleanup_cep.  I think this will give you the behavior you want: the DREQ
>process gets aborted if the CEP is destroyed by the app.

The __cleanup_cep() call is what sends the DREQ in the first place...  The cep
enters the function in the established state.

I was looking at changing __dreq_cep() to use __cep_send_mad(), which doesn't
take a reference on the cep.

- Sean