[openib-general] [PATCH] Opensm - race in opensm signalling

Yael Kalka yael at mellanox.co.il
Mon Oct 31 05:49:59 PST 2005


Hi Hal,

During our Windows testing we've encountered a case where for some
reason the opensm changes the state of its port to down, and then
brings it back up.
After debugging it, we found out that the reason for that is a
possible race when signaling "OSM_SIGNAL_NO_PENDING_TRANSACTIONS" to
the osm_state_mgr_process.
The qp0_mads_outstanding is decremented, and only later is checked if
reaches zero. So if 2 threads decrement the qp0_mads_outstanding, and
they are running simultanously, they can both signal
OSM_SIGNAL_NO_PENDING_TRANSACTIONS!
This, of course, results in a big mess in the osm_state_mgr_process
flow.
The following patch fixes this issue.

Thanks,
Yael

Signed-off-by:  Yael Kalka <yael at mellanox.co.il>

Index: opensm/osm_vl15intf.c
===================================================================
--- opensm/osm_vl15intf.c	(revision 3915)
+++ opensm/osm_vl15intf.c	(working copy)
@@ -183,28 +183,13 @@ __osm_vl15_poller(
            the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order
            to wake up the state mgr).
         */
-        cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding );
+        outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding );
         
         osm_log( p_vl->p_log, OSM_LOG_DEBUG,
                  "__osm_vl15_poller: "
                  "%u QP0 MADs outstanding.\n",
                  p_vl->p_stats->qp0_mads_outstanding );
         
-        /*
-          Acquire the lock non-exclusively.
-          Other modules that send MADs grab this lock exclusively.
-          These modules that are in the process of sending MADs
-          will hold the lock until they finish posting all the MADs
-          they plan to send.  While the other module is sending MADs
-          the outstanding count may temporarily go to zero.
-          Thus, by grabbing the lock ourselves, we get an accurate
-          view of whether or not the number of outstanding MADs is
-          really zero.
-        */
-        CL_PLOCK_ACQUIRE( p_vl->p_lock );
-        outstanding = p_vl->p_stats->qp0_mads_outstanding;
-        CL_PLOCK_RELEASE( p_vl->p_lock );
-
         if( outstanding == 0 )
         {
           /*
Index: opensm/osm_sm_mad_ctrl.c
===================================================================
--- opensm/osm_sm_mad_ctrl.c	(revision 3915)
+++ opensm/osm_sm_mad_ctrl.c	(working copy)
@@ -99,7 +99,7 @@ __osm_sm_mad_ctrl_retire_trans_mad(
 
   osm_mad_pool_put( p_ctrl->p_mad_pool, p_madw );
 
-  cl_atomic_dec( &p_ctrl->p_stats->qp0_mads_outstanding );
+  outstanding = cl_atomic_dec( &p_ctrl->p_stats->qp0_mads_outstanding );
 
   if( osm_log_is_active( p_ctrl->p_log, OSM_LOG_DEBUG ) )
   {
@@ -109,21 +109,6 @@ __osm_sm_mad_ctrl_retire_trans_mad(
              p_ctrl->p_stats->qp0_mads_outstanding );
   }
 
-  /*
-    Acquire the lock non-exclusively.
-    Other modules that send MADs grab this lock exclusively.
-    These modules that are in the process of sending MADs
-    will hold the lock until they finish posting all the MADs
-    they plan to send.  While the other module is sending MADs
-    the outstanding count may temporarily go to zero.
-    Thus, by grabbing the lock ourselves, we get an accurate
-    view of whether or not the number of outstanding MADs is
-    really zero.
-  */
-  CL_PLOCK_ACQUIRE( p_ctrl->p_lock );
-  outstanding = p_ctrl->p_stats->qp0_mads_outstanding;
-  CL_PLOCK_RELEASE( p_ctrl->p_lock );
-
   if( outstanding == 0 )
   {
     /*




More information about the general mailing list