[ofw] IBAL CEP reference counting is... interesting
Sean Hefty
sean.hefty at intel.com
Fri Jan 9 14:44:23 PST 2009
>From the code:
>/* Number of outstanding MADs. Delays destruction of CEP destruction. */
>atomic32_t ref_cnt;
>
>So it's not a ref count to manage user references. All user calls perform a
>lookup based on input CID. The MADs reference the CEP based on the context set
>in the send.
See __create_cep():
/*
* Pre-charge the reference count to 1. The code will invoke the
* destroy callback once the ref count reaches to zero.
*/
p_cep->ref_cnt = 1;
and __cleanup_cep():
return cl_atomic_dec( &p_cep->ref_cnt );
So, it's not a mad count either... maybe the fix is to change the name to
outstanding_mads and initialize it to 0. I have no idea if such a simple change
will work, but I can look into it. This should make the code cleaner, but won't
fix the hang problem.
>> For large clusters, the CM timeout can be huge. My idea to fix this
>> was to have the DREQ sent once without being tied to the cep if
>> initiated from the destroy call. Comments?
>
>Why not cancel the DREQ in __cleanup_cep if you're in the DREQ_SENT state when
>destroying? Any MAD that is sent via __cep_send_retry (any MAD that is retried
>until the CEP manager cancels it) sets p_cep->p_send_mad. Use that to cancel
>in __cleanup_cep. I think this will give you the behavior you want: the DREQ
>process gets aborted if the CEP is destroyed by the app.
The __cleanup_cep() call is what sends the DREQ in the first place... The cep
enters the function in the established state.
I was looking at changing __dreq_cep() to use __cep_send_mad(), which doesn't
take a reference on the cep.
- Sean
More information about the ofw
mailing list