[Openib-windows] Possible race In IPOIB on memfree cards.

Tzachi Dar tzachid at mellanox.co.il
Sun Apr 30 15:23:19 PDT 2006


(* I am writing on memfree cards, however this race exists probably on
cards with memory. Due to some other bug one notices it only on memfree
cards).
 
Hi Fab,
 
While debugging IPOIB flows that use stopping and starting open sm we
have came to a conclusion that there is a window in time in which revs
packets shouldn't be posted, still they are.
 
Before I start in the explanation I want to make sure that we all
understand the different steps of destroying a QP (I guess we all know
this, but just in case).
In the first step the QP is in RTS, one can post recvs and get good
answers.
When one wants to "close" the QP, he will have to move the QP to error
state. All posted recvs will be shutdown, and new posted recvs will one
day be also flushed. (no way to know exactly when, but probably in a Us
or so).
Later the QP is being brought to reset, at this stage, posted recv
packets are lost. 
Next the QP is in the init stage again.
There is a point people might forget, that once a QP is in an error
stage, the HCA is pooling for completions, still if one changes the
state too fast, he can't tell if they are all there or not.
 
In any case, looking at the IPOIB code, there are two important fields:
1) recv_mgr-> depth. 2 ) port->state. once the port goes down the state
changes to error, and on the function port up you have a busy loop that
waits for all packets to complete. In the function post_recv (so far so
good) the state field is checked (but not atomically). 
The problem starts when you change the port->state to be in RTS while
the QP is still in ERROR and later RESET stages. There is a window in
time that you can post recvs, but you can't be sure to get completion
on.
 
I have added some more debug prints and asserts that show this bug. You
can reproduce it using the test in the end of this mail. My patch also
contains a simple work around for the problem (the work around makes the
window smaller, it doesn't kill it all together). To see only the work
around look for the field DontPostRecvs in the code (only 4 lines or
so).
 
What do you think?
 
Thanks
Tzachi
 
By the way, there are probably a few ways to fix this bug, and as a
maintainer you will have to decide the best way, however here is an idea
that I have to solve the bug without introducing new fields or taking an
extra lock.
 
currently the depth field tells the number of packets that are currently
posted and is always a positive number. My change says that when the QP
is active, use the same algorithm. When the QP is going to be destroyed,
remove (atomically) 1,000,000 from the depth field. You will reach a
negative number for sure. This will mark that the QP is in closing
process. Please note that if the number you get is bigger than
-1,000,000 this means that there are posted recvs that were not yet
pooled, and you have to keep pooling until you reach -1000,000. One more
change that you should do in this case is before pooling do the atomic
add of the number pooled. If after the add you get a negative number
don't post the recevs (QP already started to close) and decrement the
number just added.
 
If you think that this idea is good I can send a patch for you.
 
 
Here is the script that restarts open sm:
 
D:\Documents and Settings\herod>type testsm.cmd
:start
start opensm
q:\unixtools\sleep 3
kill opensm
q:\unixtools\sleep 1
goto start
 
 
The patch that explains the problem:
Index: ipoib_port.c
===================================================================
--- ipoib_port.c (revision 331)
+++ ipoib_port.c (working copy)
@@ -35,7 +35,13 @@
 #include "ipoib_adapter.h"
 #include "ipoib_debug.h"
 
+long g_CurrentlyPostedRecvs = 0;
+long g_CurrentlyPolledEntries = 0;
 
+long g_CurrentlyPostedFromLastResetReset = 0;
+long g_CurrentlyPolledEntriesFromLastReset = 0;
+
+
 /*
  * PR 102492 relates to the SA not accepting wildcard values for
MCMemberRecord
  * Sets used to create/join Multicast groups.  Defining this keyword
causes all
@@ -540,6 +546,7 @@
  IPOIB_ENTER( IPOIB_DBG_INIT );
 
  p_port->state = IB_QPS_RESET;
+    p_port->DontPostRecvs = TRUE;
 
  cl_obj_construct( &p_port->obj, IPOIB_OBJ_PORT );
  cl_spinlock_construct( &p_port->send_lock );
@@ -643,6 +650,7 @@
 }
 
 
+
 static void
 __port_destroying(
  IN    cl_obj_t* const    p_obj )
@@ -676,7 +684,7 @@
  p_port = PARENT_STRUCT( p_obj, ipoib_port_t, obj );
 
  /* Wait for all sends and receives to get flushed. */
- while( p_port->send_mgr.depth || p_port->recv_mgr.depth )
+ while( p_port->send_mgr.depth || p_port->recv_mgr.depth1 )
   cl_thread_suspend( 0 );
 
  /* Destroy the send and receive managers before closing the CA. */
@@ -746,6 +754,7 @@
  ib_qp_attr_t  qp_attr;
 
  IPOIB_ENTER( IPOIB_DBG_INIT );
+    IPOIB_TRACE_EXIT( IPOIB_DBG_ERROR,("__ib_mgr_init creating
everything\n"));
 
  /* Open the CA. */
  status = p_port->p_adapter->p_ifc->open_ca(
@@ -1298,7 +1307,7 @@
  IPOIB_ENTER( IPOIB_DBG_INIT );
 
  CL_ASSERT( cl_is_qlist_empty( &p_port->recv_mgr.done_list ) );
- CL_ASSERT( !p_port->recv_mgr.depth );
+ CL_ASSERT( !p_port->recv_mgr.depth1 );
 
  if( p_port->recv_mgr.recv_pkt_array )
   cl_free( p_port->recv_mgr.recv_pkt_array );
@@ -1324,19 +1333,20 @@
  CL_ASSERT( p_port );
  CL_ASSERT( num_wrs );
  cl_obj_lock( &p_port->obj );
- if( p_port->state == IB_QPS_ERROR )
+ if(( p_port->state == IB_QPS_ERROR ) || (p_port->DontPostRecvs ==
TRUE))
  {
   cl_obj_unlock( &p_port->obj );
   IPOIB_TRACE_EXIT( IPOIB_DBG_RECV,
    ("Port in invalid state.  Not reposting.\n") );
   return IB_SUCCESS;
  }
+    ASSERT(p_port->state == IB_QPS_RTS);
  cl_obj_ref( &p_port->obj );
  cl_obj_unlock( &p_port->obj );
 
  while( num_wrs-- )
  {
-  if( p_port->recv_mgr.depth == p_port->p_adapter->params.rq_depth )
+  if( p_port->recv_mgr.depth1 == p_port->p_adapter->params.rq_depth )
   {
    IPOIB_TRACE( IPOIB_DBG_INFO | IPOIB_DBG_RECV,
     ("Receive queue already full\n") );
@@ -1366,7 +1376,7 @@
 
   p_head = p_next;
 
-  cl_atomic_inc( &p_port->recv_mgr.depth );
+  cl_atomic_inc( &p_port->recv_mgr.depth1 );
  }
 
  if( !p_head )
@@ -1377,8 +1387,27 @@
  }
 
  cl_perf_start( PostRecv );
+    if( p_port->state == IB_QPS_ERROR ) {
+        IPOIB_TRACE( IPOIB_DBG_ERROR,("****before Posting while in
error"));
+    }
+    {
+        // Count the number of objects
+        int i = 0;
+        ib_recv_wr_t *pTemp = &p_head->wr;
+        while (pTemp != NULL) {
+            i++;
+            pTemp = pTemp->p_next;
+        }
+        InterlockedExchangeAdd(&g_CurrentlyPostedRecvs, i);
+        InterlockedExchangeAdd(&g_CurrentlyPostedFromLastResetReset,
i);
+
+    }
  status = p_port->p_adapter->p_ifc->post_recv(
   p_port->ib_mgr.h_qp, &p_head->wr, &p_failed );
+    if( p_port->state == IB_QPS_ERROR ) {
+        IPOIB_TRACE( IPOIB_DBG_ERROR,("****After Posting while in
error"));
+    }
+
  cl_perf_stop( &p_port->p_adapter->perf, PostRecv );
 
  /*
@@ -1397,7 +1426,8 @@
    p_failed = p_failed->p_next;
 
    __buf_mgr_put_recv( p_port, p_head, NULL );
-   cl_atomic_dec( &p_port->recv_mgr.depth );
+   cl_atomic_dec( &p_port->recv_mgr.depth1 );
+            CL_ASSERT(FALSE);
   }
  }
 
@@ -1495,7 +1525,7 @@
 {
  ipoib_port_t  *p_port;
  ib_api_status_t  status;
- ib_wc_t    wc[MAX_RECV_WC], *p_free, *p_wc;
+ ib_wc_t    wc[MAX_RECV_WC * 2], *p_free, *p_wc;
  uint32_t   pkt_cnt, recv_cnt = 0;
  cl_qlist_t   done_list, bad_list;
  size_t    i;
@@ -1524,9 +1554,9 @@
 
  cl_obj_ref( &p_port->obj );
 
- for( i = 0; i < MAX_RECV_WC; i++ )
+ for( i = 0; i < MAX_RECV_WC * 2; i++ ) //?????????? * 2
   wc[i].p_next = &wc[i + 1];
- wc[MAX_RECV_WC - 1].p_next = NULL;
+ wc[MAX_RECV_WC * 2 - 1].p_next = NULL; //???????? X2
 
  /*
   * We'll be accessing the endpoint map so take a reference
@@ -1558,7 +1588,7 @@
  cl_atomic_dec( &p_port->endpt_rdr );
 
  /* Update our posted depth. */
- cl_atomic_sub( &p_port->recv_mgr.depth, recv_cnt );
+ cl_atomic_sub( &p_port->recv_mgr.depth1, recv_cnt );
 
  cl_perf_start( BuildPktArray );
  /* Notify NDIS of any and all possible receive buffers. */
@@ -1758,6 +1788,11 @@
 
  IPOIB_ENTER( IPOIB_DBG_RECV );
 
+    for( p_wc = p_done_wc_list; p_wc; p_wc = p_wc->p_next ) {
+        InterlockedIncrement(&g_CurrentlyPolledEntries);
+        InterlockedIncrement(&g_CurrentlyPolledEntriesFromLastReset);
+    }
+
  for( p_wc = p_done_wc_list; p_wc; p_wc = p_wc->p_next )
  {
   CL_ASSERT( p_wc->wc_type == IB_WC_RECV );
@@ -1772,6 +1807,7 @@
      ("Failed completion %s\n",
      p_port->p_adapter->p_ifc->get_wc_status_str( p_wc->status )) );
     ipoib_inc_recv_stat( p_port->p_adapter, IP_STAT_ERROR, 0 );
+                ASSERT(FALSE);
    }
    else
    {
@@ -1780,7 +1816,9 @@
      p_port->p_adapter->p_ifc->get_wc_status_str( p_wc->status )) );
     ipoib_inc_recv_stat( p_port->p_adapter, IP_STAT_DROPPED, 0 );
    }
-   cl_qlist_insert_tail( p_bad_list, &p_desc->item.list_item );
+//            if (p_bad_list->count < 0x80) {
+       cl_qlist_insert_tail( p_bad_list, &p_desc->item.list_item );
+//            }
    /* Dereference the port object on behalf of the failed receive. */
    cl_obj_deref( &p_port->obj );
    continue;
@@ -4419,10 +4457,23 @@
  IPOIB_ENTER( IPOIB_DBG_INIT );
 
  /* Wait for all receives to get flushed. */
- while( p_port->recv_mgr.depth )
+ while( p_port->recv_mgr.depth1 )
   cl_thread_suspend( 0 );
 
+    g_CurrentlyPolledEntriesFromLastReset = 0;
+    g_CurrentlyPostedFromLastResetReset = 0;
+/*
+    __ib_mgr_destroy(p_port);
+
+ __recv_mgr_destroy( p_port );
+// __send_mgr_destroy( p_port );
+ __buf_mgr_destroy( p_port );
+
+
+    __port_init_short(p_port);
+*/
  p_port->state = IB_QPS_RTS;
+    ASSERT(p_port->DontPostRecvs == TRUE);
 
  info.method = IB_MAD_METHOD_GET;
  info.attr_id = IB_MAD_ATTR_PORTINFO_RECORD;
@@ -4736,6 +4787,9 @@
  ib_api_status_t  status;
  ib_mcast_req_t  mcast_req;
 
+
+    IPOIB_TRACE( IPOIB_DBG_ERROR,("__port_join_bcast called \n"));
+
  IPOIB_ENTER( IPOIB_DBG_MCAST );
 
  /* Check that the rate is realizable for our port. */
@@ -4802,6 +4856,8 @@
  ib_api_status_t  status;
  ib_mcast_req_t  mcast_req;
 
+    IPOIB_TRACE( IPOIB_DBG_ERROR,("__port_create_bcast called \n"));
+
  IPOIB_ENTER( IPOIB_DBG_MCAST );
 
  /* Join the broadcast group. */
@@ -4873,6 +4929,7 @@
  /* Mark our state.  This causes all callbacks to abort. */
  cl_obj_lock( &p_port->obj );
  p_port->state = IB_QPS_ERROR;
+    p_port->DontPostRecvs = TRUE;;
 
  NdisWriteErrorLogEntry( p_port->p_adapter->h_adapter,
   EVENT_IPOIB_PORT_DOWN, 0 );
@@ -4891,6 +4948,7 @@
  CL_ASSERT( p_port->ib_mgr.h_qp );
  cl_memclr( &qp_mod, sizeof(ib_qp_mod_t) );
  qp_mod.req_state = IB_QPS_ERROR;
+    IPOIB_TRACE( IPOIB_DBG_ERROR,("Before changing to error, currently
posted %d\n", p_port->recv_mgr.depth1));
  status = p_port->p_adapter->p_ifc->modify_qp( p_port->ib_mgr.h_qp,
&qp_mod );
  if( status != IB_SUCCESS )
  {
@@ -4900,6 +4958,7 @@
   p_port->p_adapter->hung = TRUE;
   return;
  }
+    IPOIB_TRACE( IPOIB_DBG_ERROR,("After changing to error, currently
posted %d\n", p_port->recv_mgr.depth1));
 
  /* Reset all endpoints so we don't flush our ARP cache. */
  __endpt_mgr_reset_all( p_port );
@@ -4981,6 +5040,8 @@
   return;
  }
 
+
+
  /* Get the QP ready for action. */
  status = __ib_mgr_activate( p_port );
  if( status != IB_SUCCESS )
@@ -4996,6 +5057,8 @@
  }
 
  /* Notify the adapter that we now have an active connection. */
+
+    
  ipoib_set_active( p_port->p_adapter );
 
  cl_obj_deref( &p_port->obj );
@@ -5039,6 +5102,11 @@
   * Move the QP to RESET.  This allows us to reclaim any
   * unflushed receives.
   */
+
+    ASSERT(g_CurrentlyPostedFromLastResetReset == 0);
+    ASSERT(g_CurrentlyPolledEntriesFromLastReset == 0);
+    ASSERT( p_port->recv_mgr.depth1 == 0 );
+   
  cl_memclr( &qp_mod, sizeof(ib_qp_mod_t) );
  qp_mod.req_state = IB_QPS_RESET;
  status = p_port->p_adapter->p_ifc->modify_qp( p_port->ib_mgr.h_qp,
&qp_mod );
@@ -5083,6 +5151,7 @@
 
  /* Prepost receives. */
  cl_spinlock_acquire( &p_port->recv_lock );
+    p_port->DontPostRecvs = FALSE;
  __recv_mgr_repost( p_port, p_port->p_adapter->params.rq_depth );
  cl_spinlock_release( &p_port->recv_lock );
 
Index: ipoib_port.h
===================================================================
--- ipoib_port.h (revision 331)
+++ ipoib_port.h (working copy)
@@ -406,7 +406,7 @@
 
 typedef struct _ipoib_recv_mgr
 {
- atomic32_t  depth;
+ atomic32_t  depth1;
 
  NDIS_PACKET  **recv_pkt_array;
 
@@ -475,6 +475,7 @@
  cl_obj_rel_t   rel;
 
  ib_qp_state_t   state;
+    BOOLEAN                    DontPostRecvs;
 
  cl_spinlock_t   recv_lock;
  cl_spinlock_t   send_lock;

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20060501/efb5dc39/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ipoib.diff
Type: application/octet-stream
Size: 9518 bytes
Desc: ipoib.diff
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20060501/efb5dc39/attachment.obj>


More information about the ofw mailing list