[ofw] A problem in the ai_ioc_pnp.c file

Tzachi Dar tzachid at mellanox.co.il
Tue Feb 8 09:40:32 PST 2011


Hello Fab,
Recently, we got the following assert  at __ioc_async_cb:            CL_ASSERT( !gp_ioc_pnp->query_cnt ); (==-1)
We suspect the problem appeared because of:

1.       From __ioc_pnp_timer_cb(
                /* Pre-charge the ref count so that we don't toggle between 0 and 1. */
                cl_atomic_inc( &p_mgr->query_cnt );
                /* Take a reference on the object for the duration of the sweep process. */
                ref_al_obj( &p_mgr->obj );
                for( p_item = cl_qlist_head( &p_mgr->obj.obj_list );
                                p_item != cl_qlist_end( &p_mgr->obj.obj_list );
                                p_item = cl_qlist_next( p_item ) )
                {
                                p_svc = PARENT_STRUCT( PARENT_STRUCT( p_item, al_obj_t, pool_item ),
                                                ioc_pnp_svc_t, obj );
                                cl_atomic_inc( &p_mgr->query_cnt );
                                status = __ioc_query_sa( p_svc );
                                if( status != IB_SUCCESS )
                                                cl_atomic_dec( &p_mgr->query_cnt );
                }
                /* Release the reference we took and see if we're done sweeping. */
                if( !cl_atomic_dec( &p_mgr->query_cnt ) )
                                cl_async_proc_queue( gp_async_pnp_mgr, &p_mgr->async_item )

2.       Here one assumes that either one of 2 situations is true:

·         __ioc_query_sa - returns success and later the reference will be removed by the appropriate callback exactly one time

·         It fails and the reference will be removed immediately
Occasionally, this is not the case

3.       __ioc_query_sa calls twice to ib_query: for Node and Path queries. Thus, 2 callbacks will be invoked thereafter: __node_rec_cb and __path_rec_cb

4.       Each of these callback may (and will) call to __process_query()

5.       If process_query will be called twice, that query_cnt will be decreased twice and we are in trouble.

6.       Theoretically, there is another counter that should avoid this case.

>From __ioc_query_sa():

........

cl_atomic_inc( &p_svc->query_cnt );<---Raise reference to ensure that __process_query will be called only once

status = ib_query( gh_al, &query, &p_svc->h_node_query );

        if( status != IB_SUCCESS )

        {

                        cl_atomic_dec( &p_svc->query_cnt );

                        deref_al_obj( &p_svc->obj );

                        AL_PRINT_EXIT( TRACE_LEVEL_WARNING, AL_DBG_PNP,

                                        ("ib_query returned %s\n", ib_get_err_str( status )) );

                        return status;

        }

..............

        /* Reference the service for the node record query. */

        ref_al_obj( &p_svc->obj );

        cl_atomic_inc( &p_svc->query_cnt );



        status = ib_query( gh_al, &query, &p_svc->h_path_query );

        if( status != IB_SUCCESS )

        {

                        cl_atomic_dec( &p_svc->query_cnt );

                        deref_al_obj( &p_svc->obj );

                        AL_PRINT_EXIT( TRACE_LEVEL_WARNING, AL_DBG_PNP,

                                        ("ib_query returned %s\n", ib_get_err_str( status )) );

        }

>From __path_rec_cb and node_rec_cb:



if( !cl_atomic_dec( &p_svc->query_cnt ) ) //should avoid calling __process_query twice

        {

                        /* The node query has already completed.  Process the results. */

                        __process_query( p_svc );

        }



The issues that we see with the following situation are this:

1)       When __ioc_query_sa is running, the first query will be executed and it's call back will be called before the second query starts. In this case, the __node_rec_cb will decrease the p_svc->query_cnt and bring it to 0. Thus, __process_query will be called twice !

2)       When looking at the function __process_query we expect it to call cl_atomic_dec( &gp_ioc_pnp->query_cnt)exactly one time. However, if the code executes till the end of the function, than we reach a case statement in which in the case of status == success the reference will not be removed.

3)       p_svc->query_cnt seems to be like a reference count that is protecting gp_ioc_pnp->query_cnt. However this only works if p_svc->query_cnt was 0 when __ioc_pnp_timer_cb was called and no one has touched it on the entire time. However, we see that p_svc->query_cnt is being touched in many places in the code, and we don't seem to understand why this assumptions are right.

4)       What is the consequences of calling __process_query more than once (this is an issue that we believe that might currently happen in the code).

We are still trying to understand the assumptions that this code is based on.

Thanks
Tzachi & Alexander (XaleX) Naslednikov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20110208/0be55865/attachment.html>


More information about the ofw mailing list