[openib-general] Re: [PATCH] Opensm - handling immediate error in vendor_send
Hal Rosenstock
halr at voltaire.com
Mon Oct 10 07:03:46 PDT 2005
Hi Yael,
On Sun, 2005-10-09 at 07:18, Yael Kalka wrote:
> During our tests on Windows we encountered an issue that is caused due
> to some problem in the lower layer, but causes problem in the opensm.
> If the osm_vendor_send call fails immediatly, we need to update
> several counters (currently, only qp0_mads_sent is decremented), and
> also all the dispatcher, if we reached qp0_mads_outstanding == 0 (in
> order to signal the state mgr).
> What we saw was that these counters weren't decremented, and thus the
> state mgr wasn't signalled, and the opensm didn't proceed in
> traversing through its stages.
> The following patch updates the relevant counters, and calls the
> dispatcher, if neccessary.
Is there a similar issue with QP1 as well ?
Also, in general, atomic_inc and atomic_dec deal with int32 quantities.
There is potential danger if they wrap from positive to negative or visa
versa. I don't think there is any code which deals with this.
I have some comments and questions on this patch embedded below.
-- Hal
>
> Thanks,
> Yael
>
> Signed-off-by: Yael Kalka <yael at mellanox.co.il>
> Index: opensm/osm_vl15intf.c
> ===================================================================
> --- opensm/osm_vl15intf.c (revision 3703)
> +++ opensm/osm_vl15intf.c (working copy)
> @@ -157,6 +157,8 @@ __osm_vl15_poller(
>
> if( status != IB_SUCCESS )
> {
> + uint32_t outstanding;
> + cl_status_t cl_status;
> osm_log( p_vl->p_log, OSM_LOG_ERROR,
> "__osm_vl15_poller: ERR 3E03: "
> "MAD send failed (%s).\n",
> @@ -166,7 +168,64 @@ __osm_vl15_poller(
> The MAD was never successfully sent, so
> fix up the pre-incremented count values.
> */
> + /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire
> + that was incremented in the code above. */
> mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent );
> + if( p_madw->resp_expected == TRUE )
> + if ( !&p_vl->p_stats->qp0_mads_outstanding_on_wire )
Should this be !&p_vl->p_stats->qp0_mads_outstanding_on_wire or just
!p_vl->p_stats->qp0_mads_outstanding_on_wire ? If it is the latter,
should there be locking around it like:
CL_PLOCK_ACQUIRE( p_ctrl->p_lock );
outstanding = p_ctrl->p_stats->qp0_mads_outstanding;
CL_PLOCK_RELEASE( p_ctrl->p_lock );
Also, this appears to be debug code (not in other places) ? Why is it
needed here ?
> + osm_log( p_vl->p_log, OSM_LOG_ERROR,
> + "__osm_vl15_poller: ERR 3E04: "
> + "Trying to dec qp0_mads_outstanding_on_wire=0. "
> + "Problem with transaction mgr!\n");
In this case, outstanding is not initialized so what is supposed to
occur below when outstanding is checked against 0. (Should it be
initialized to 0 ? Do extra signals to the state manager (for
NO_PENDING_TRANSACTIONS) cause the wrong thing to occur ?).
> + else
> + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire );
> +
> + /* The following code is similar to the one in
> + __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the
> + qp0_mads_outstanding counter, and if we reached 0 - need to call
> + the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order
> + to wake up the state mgr). */
> + if ( !&p_vl->p_stats->qp0_mads_outstanding )
> + osm_log( p_vl->p_log, OSM_LOG_ERROR,
> + "__osm_vl15_poller: ERR 3E05: "
> + "Trying to dec qp0_mads_outstanding=0. "
> + "Problem with transaction mgr!\n");
> + else
> + outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding );
> +
> + osm_log( p_vl->p_log, OSM_LOG_DEBUG,
> + "__osm_vl15_poller: "
> + "%u(%u) QP0 MADs outstanding.\n",
> + p_vl->p_stats->qp0_mads_outstanding,outstanding );
Should the following preceed this DEBUG call to osm_log:
if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
> + if( outstanding == 0 )
> + {
> + /*
> + The wire is clean.
> + Signal the state manager.
> + */
> + if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
> + {
> + osm_log( p_vl->p_log, OSM_LOG_DEBUG,
> + "__osm_vl15_poller: "
> + "Posting Dispatcher message %s.\n",
> + osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) );
> + }
> +
> + cl_status = cl_disp_post( p_vl->h_disp,
> + OSM_MSG_NO_SMPS_OUTSTANDING,
> + (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS,
> + NULL,
> + NULL );
> +
> + if( cl_status != CL_SUCCESS )
> + {
> + osm_log( p_vl->p_log, OSM_LOG_ERROR,
> + "__osm_vl15_poller: ERR 3E06: "
> + "Dispatcher post message failed (%s).\n",
> + CL_STATUS_MSG( cl_status ) );
> + }
> + }
> }
> else
> {
Also, the formatting has extra whitespace. (I fixed this by hand).
-- Hal
More information about the general
mailing list