[Openib-windows] RE: A bug in ib_cm_listen?

Mon Sep 12 11:11:09 PDT 2005

> From: Tzachi Dar [mailto:tzachid at mellanox.co.il]
> Sent: Monday, September 12, 2005 12:52 AM
> 
> Hi Fab,
> 
> While running some tests with SDP I have reached a problem in the
> destruction of objects.
> 
> I think I know where the problem is, can you please verify this and check
> the fix in?

Fixed and checked in revision 63.

> Here is the problem:
> When calling ib_cm_listen() twice with the same parameters, the second one
> fails (which is what we expect), however there is an assertion in this
> process. The assertion points to the fact that an object is being destroyed
> twice.
> 
> You can reproduce this behavior very easily.
> 
> More information:
> The code in ib_cm_listen() calls __cep_listen().__cep_listen calls
> al_cep_listen (line 1962). Since the status represents a failure,
> p_listen->obj.pfn_destroy( &p_listen->obj, NULL ); is being called. From
> what I saw in other places it seems that before this call ref_al_obj(
> &p_listen->obj ); should also be called (this is probably the place that the
> bug is).

The problem is actually that the code uses the reference taken by init_al_obj as
the reference taken on the listen on behalf of the CEP (since the CEP can invoke
un up-call).  If al_listen_cep fails, __destroying_listen will destroy the CEP,
which will release the reference from init_al_obj.  Thus, when
__destroying_listen returns, the reference count will be wrong if called from
__create_cep.  I changed the code to explicitly take a reference for the CEP
after al_create_cep returns success.  This reference is release in
__destroying_cep.

> Can you also please explain why sometimes the reference is being taken
> before the destroy and sometimes not?

When objects are created, init_al_obj returns with a reference on the object so
that once it is attached, parallel destruction of the parent won't destroy the
child while it is undergoing initialization.

When destroying objects, the caller of the destructor (p_obj->pfn_destroy) is
expected to have a reference on the object.  Because destruction can't always
happen at DISPATCH, callers can't hold a lock while calling the destructor.
Therefore, callers that store children in lists must find the object while
holding a lock, take a reference, release the lock, and invoke the destructor.
The reference taken in the loop prevents the object from being destroyed in
parallel.  This was required to properly destroy QPs and CQs, since the CQ track
their bound QPs to create a one (QP) to many (CQ) relationship that the al_obj
abstraction doesn't support.

Normally, during object creation you will see all calls to destroy *not*
reference the object again, since init_al_obj already did the referencing
implicitly.  Any other destruction requires the caller to take a reference.

Hopefully that made sense - let me know if you have any more questions about
destruction.

All this is going to go away, though, since we're going to move to a model where
destroy will fail if there are any bound objects.

- Fab