[Openib-windows] races on __destrot_obj function

Thu Jul 6 07:45:29 PDT 2006

Hi Yossi,

On 7/6/06, Yossi Leybovich <sleybo at mellanox.co.il> wrote:
>
> Fab
>
> We have ran test that move openSM up/down for several minutes.
> I got ASSERT in the function __destory_cb :
> ".....
> static void
> __destroy_cb(
>  IN    cl_async_proc_item_t  *p_item )
> {
>  cl_obj_t    *p_obj, *p_last_parent;
>
>  CL_ASSERT( p_item );
>
>  p_obj = PARENT_STRUCT( p_item, cl_obj_t, async_item );
>  CL_ASSERT( !p_obj->ref_cnt );
>
> ...."
>
> I tried to track which reference was not deref and found that one of the
> mcast  join query still hold reference.
> If we look at the code at __destrot_obj we can see that the code wait 10 sec
> ,then assume that the reference went down and move to destory_cb
>
>
>  if( destroy_type == CL_DESTROY_SYNC )
>  {
>   if( ref_cnt )
>   {
>    /* Wait for all other references to go away. */
>    cl_event_wait_on( &p_obj->event, 10000000, FALSE );
>   }
>   __destroy_cb( &p_obj->async_item );
>  }
>
> unfortunately in the IPoIB registry configuration we were configured to 10
> retries and timeout of 1000ms (10 sec)
> So we end up going to the destroy_cb while we still hold reference.
> Don't you think that it will be better to put infinite time in
> cl_event_wait_on and wait till all the references will be returned?

The behavior should mimic what is done in the al_obj code - in debug
builds, the code will wait for a limited time, and then assert (so you
can actually trap an erroneous case), while in free builds it waits
indefinitely.

What object was being destroyed?  I assume the port object.   The
cl_obj has a type member, which is defined by the IPoIB in this case:

#define IPOIB_OBJ_INSTANCE		1
#define IPOIB_OBJ_PORT		2
#define IPOIB_OBJ_ENDPOINT		3

We should add a mechanism to allow MC requests to be cancelled.  This
should be fairly straight forward I think.  This would allow the mc
requests to be cancelled, which would reduce the wait time.

> I don't think any value can predict how much time it will take to ref_cnt to
> go to 0.

I agree - the timeout should be infinite, or an infinite loop of
finite values and assertions.

> We better find the problem and wait for ref_cnt to be 0, other then cont.
> and get blue screen.(We ran this test in free version and got blue screen)

Agreed.

> BTW
> Do we have way in our stack to track which client did not return its
> reference ?

There is no code to track the various places references are taken.  I
don't know if there's a good way to find out who the caller was, so
this ends up having to be done manually in each client (i.e. have a
second reference count per location references are taken or something
similar).

> To debug this problem I create array that each index present a call to the
> cl_obj_ref on IPoIB port obj.
> Each call to ref increment the array and each call to deref decrement the
> array in that way I found which client did not return reference.
> (there are cases that one call to ref maps to few calls of deref )
> Do you have another suggestion how to debug this problems? I know leonid
> also chasing lost ref in the IBAL.

Most of the times I've debugged this I've done it through code
walkthroughs and code inspection, looking at all the code paths where
references are taken.

> Do you think we should add the ref tracking code to the trunk ?

I don't see us using the code except when looking for reference
errors.  I don't have much of an objection in adding it in debug
builds, though.

- Fab