[openib-general] Re: [PATCH] Opensm - handling immediate error in vendor_send

Hal Rosenstock halr at voltaire.com
Mon Oct 10 07:03:46 PDT 2005


Hi Yael,

On Sun, 2005-10-09 at 07:18, Yael Kalka wrote:
> During our tests on Windows we encountered an issue that is caused due
> to some problem in the lower layer, but causes problem in the opensm.
> If the osm_vendor_send call fails immediatly, we need to update
> several counters (currently, only qp0_mads_sent is decremented), and
> also all the dispatcher, if we reached qp0_mads_outstanding == 0 (in
> order to signal the state mgr).
> What we saw was that these counters weren't decremented, and thus the
> state mgr wasn't signalled, and the opensm didn't proceed in
> traversing through its stages.
> The following patch updates the relevant counters, and calls the
> dispatcher, if neccessary.

Is there a similar issue with QP1 as well ?

Also, in general, atomic_inc and atomic_dec deal with int32 quantities.
There is potential danger if they wrap from positive to negative or visa
versa. I don't think there is any code which deals with this.

I have some comments and questions on this patch embedded below.

-- Hal

> 
> Thanks,
> Yael
> 
> Signed-off-by:  Yael Kalka <yael at mellanox.co.il>

> Index: opensm/osm_vl15intf.c
> ===================================================================
> --- opensm/osm_vl15intf.c	(revision 3703)
> +++ opensm/osm_vl15intf.c	(working copy)
> @@ -157,6 +157,8 @@ __osm_vl15_poller(
>  
>        if( status != IB_SUCCESS )
>        {
> +        uint32_t outstanding;
> +        cl_status_t cl_status;
>          osm_log( p_vl->p_log, OSM_LOG_ERROR,
>                   "__osm_vl15_poller: ERR 3E03: "
>                   "MAD send failed (%s).\n",
> @@ -166,7 +168,64 @@ __osm_vl15_poller(
>            The MAD was never successfully sent, so
>            fix up the pre-incremented count values.
>          */
> +        /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire
> +           that was incremented in the code above. */
>          mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent );
> +        if( p_madw->resp_expected == TRUE ) 
> +          if ( !&p_vl->p_stats->qp0_mads_outstanding_on_wire ) 

Should this be !&p_vl->p_stats->qp0_mads_outstanding_on_wire or just
!p_vl->p_stats->qp0_mads_outstanding_on_wire ? If it is the latter,
should there be locking around it like:
  CL_PLOCK_ACQUIRE( p_ctrl->p_lock );
  outstanding = p_ctrl->p_stats->qp0_mads_outstanding;
  CL_PLOCK_RELEASE( p_ctrl->p_lock );

Also, this appears to be debug code (not in other places) ? Why is it
needed here ?

> +            osm_log( p_vl->p_log, OSM_LOG_ERROR, 
> +                     "__osm_vl15_poller: ERR 3E04: " 
> +                     "Trying to dec qp0_mads_outstanding_on_wire=0. " 
> +                     "Problem with transaction mgr!\n");

In this case, outstanding is not initialized so what is supposed to
occur below when outstanding is checked against 0. (Should it be
initialized to 0 ? Do extra signals to the state manager (for
NO_PENDING_TRANSACTIONS) cause the wrong thing to occur ?).
 
> +          else 
> +            cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire ); 
> +
> +        /* The following code is similar to the one in 
> +           __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the 
> +           qp0_mads_outstanding counter, and if we reached 0 - need to call
> +           the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order
> +           to wake up the state mgr). */
> +        if ( !&p_vl->p_stats->qp0_mads_outstanding )
> +          osm_log( p_vl->p_log, OSM_LOG_ERROR,
> +                    "__osm_vl15_poller: ERR 3E05: "
> +                   "Trying to dec qp0_mads_outstanding=0. "
> +                   "Problem with transaction mgr!\n");
> +        else
> +          outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding );
> +        
> +        osm_log( p_vl->p_log, OSM_LOG_DEBUG,
> +                 "__osm_vl15_poller: "
> +                 "%u(%u) QP0 MADs outstanding.\n",
> +                 p_vl->p_stats->qp0_mads_outstanding,outstanding );

Should the following preceed this DEBUG call to osm_log:
if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )

> +        if( outstanding == 0 )
> +        {
> +          /*
> +            The wire is clean.
> +            Signal the state manager.
> +          */
> +          if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
> +          {
> +            osm_log( p_vl->p_log, OSM_LOG_DEBUG,
> +                     "__osm_vl15_poller: "
> +                     "Posting Dispatcher message %s.\n",
> +                     osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) );
> +          }
> +          
> +          cl_status = cl_disp_post( p_vl->h_disp,
> +                                    OSM_MSG_NO_SMPS_OUTSTANDING,
> +                                    (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS,
> +                                    NULL,
> +                                    NULL );
> +          
> +          if( cl_status != CL_SUCCESS )
> +          {
> +            osm_log( p_vl->p_log, OSM_LOG_ERROR,
> +                     "__osm_vl15_poller: ERR 3E06: "
> +                     "Dispatcher post message failed (%s).\n",
> +                     CL_STATUS_MSG( cl_status ) );
> +          }
> +        }
>        }
>        else
>        {

Also, the formatting has extra whitespace. (I fixed this by hand).

-- Hal




More information about the general mailing list