[ofw] IBAL CEP reference counting is... interesting

Fri Jan 9 14:28:39 PST 2009

> While trying to track down a ctrl-c hang, I noticed that the CM cep
> code uses a ref_cnt that is allowed to drop to 0 while references are
> still held on the cep (connection endpoint).

>From the code:
/* Number of outstanding MADs.  Delays destruction of CEP destruction. */
atomic32_t                                      ref_cnt;

So it's not a ref count to manage user references.  All user calls perform a lookup based on input CID.  The MADs reference the CEP based on the context set in the send.

> The problem is most easily seen in process_timewait(), where the
> ref_cnt of the cep must be 0 before being processed.  Meaning that the
> cep structure is expected to be on a timewait list with a ref_cnt of 0.

Yes, there should be no outstanding MADs being sent for a CEP for it to be processed for timewait.

> Also, at the end of the loop in process_timewait(), if the cep state
> is not cep_state_destroy, its state is set to idle and left dangling.
> (Either there's no reference on the cep, or whatever has a reference
> on it has not incremented the ref_cnt.)

The validity of the CID and AL handle provided to the al_cep_xxx calls is what determines if a CEP is valid or not.  The ref count is used internally to handle destruction while MADs are outstanding.

> I'm not sure how to fix this.

I'm not sure what the problem is.

> As for the ctrl-c hang, I tracked that problem down to destroying the
> cep in the established state.  (I have the remote endpoint of the
> connection blocked from
> responding.)  The issue is that the cep being destroyed sends a DREQ,
> taking a reference on the cep.  The reference is not released until
> the DREQ has been retried and completely times out, resulting in
> blocking the upper level code waiting for the destroy callback.
>
> Destroying a cep needs to be limited to some sort of reasonable time,
> rather than on the order of seconds or minutes, depending on the
> remote CM response.

Agreed.

> For large clusters, the CM timeout can be huge.  My idea to fix this
> was to have the DREQ sent once without being tied to the cep if
> initiated from the destroy call.  Comments?

Why not cancel the DREQ in __cleanup_cep if you're in the DREQ_SENT state when destroying?  Any MAD that is sent via __cep_send_retry (any MAD that is retried until the CEP manager cancels it) sets p_cep->p_send_mad.  Use that to cancel in __cleanup_cep.  I think this will give you the behavior you want: the DREQ process gets aborted if the CEP is destroyed by the app.

-Fab