<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

<META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD>

<BODY>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>(* I am writing on 

memfree cards, however this race exists probably on cards with memory. Due to 

some other bug one notices it only on memfree cards).</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Hi 

Fab,</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>While debugging 

IPOIB flows that use stopping and starting open sm we have came to a conclusion 

that there is a window in time in which revs packets shouldn't be posted, still 

they are.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Before I start in 

the explanation I want to make sure that we all understand the different steps 

of destroying a QP (I guess we all know this, but just in 

case).</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>In the first step 

the QP is in RTS, one can post recvs and get good answers.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>When one wants to 

"close" the QP, he will have to move the QP to error state. All posted recvs 

will be shutdown, and new posted recvs will one day be also flushed. (no way to 

know exactly when, but probably in a Us or so).</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Later the QP is 

being brought to reset, at this stage, posted recv packets are lost. 

</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Next the QP is in 

the init stage again.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>There is a point 

people might forget, that once a QP is in an error stage, the HCA is pooling for 

completions, still if one changes the state too fast, he can't tell if they are 

all there or not.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>In any case, looking 

at the IPOIB code, there are two important fields: 1) recv_mgr-> depth. 2 ) 

port->state. once the port goes down the state changes to error, and on the 

function port up you have a busy loop that waits for all packets to complete. In 

the function post_recv (so far so good) the state field is checked (but not 

atomically). </SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>The problem starts 

when you change the port->state to be in RTS while the QP is still in ERROR 

and later RESET stages. There is a window in time that you can post recvs, but 

you can't be sure to get completion on.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>I have added some 

more debug prints and asserts that show this bug. You can reproduce it using the 

test in the end of this mail. My patch also contains a simple work around for 

the problem (the work around makes the window smaller, it doesn't kill it all 

together). To see only the work around look for the field DontPostRecvs in the 

code (only 4 lines or so).</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>What do you 

think?</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006>Thanks</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006>Tzachi</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>By the way, there 

are probably a few ways to fix this bug, and as a maintainer you will have to 

decide the best way, however here is an idea that I have to solve the bug 

without introducing new fields or taking an extra lock.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>currently the depth 

field tells the number of packets that are currently posted and is always a 

positive number. My change says that when the QP is active, use the same 

algorithm. When the QP is going to be destroyed, remove (atomically) 1,000,000 

from the depth field. You will reach a negative number for sure. This will mark 

that the QP is in closing process. Please note that if the number you get is 

bigger than -1,000,000 this means that there are posted recvs that were not yet 

pooled, and you have to keep pooling until you reach -1000,000. One more change 

that you should do in this case is before pooling do the atomic add of the 

number pooled. If after the add you get a negative number don't post the recevs 

(QP already started to close) and decrement the number just 

added.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>If you think that 

this idea is good I can send a patch for you.</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Here is the script 

that restarts open sm:</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>D:\Documents and 

Settings\herod>type testsm.cmd<BR>:start<BR>start 

opensm<BR>q:\unixtools\sleep 3<BR>kill opensm<BR>q:\unixtools\sleep 1<BR>goto 

start</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN 

class=935384421-30042006></SPAN></FONT> </DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>The patch that 

explains the problem:</SPAN></FONT></DIV>

<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Index: 

ipoib_port.c<BR>===================================================================<BR>--- 

ipoib_port.c (revision 331)<BR>+++ ipoib_port.c (working copy)<BR>@@ 

-35,7 +35,13 @@<BR> #include "ipoib_adapter.h"<BR> #include 

"ipoib_debug.h"<BR> <BR>+long g_CurrentlyPostedRecvs = 0;<BR>+long 

g_CurrentlyPolledEntries = 0;<BR> <BR>+long 

g_CurrentlyPostedFromLastResetReset = 0;<BR>+long 

g_CurrentlyPolledEntriesFromLastReset = 0;<BR>+<BR>+<BR> /*<BR>  * PR 

102492 relates to the SA not accepting wildcard values for 

MCMemberRecord<BR>  * Sets used to create/join Multicast groups.  

Defining this keyword causes all<BR>@@ -540,6 +546,7 

@@<BR>  IPOIB_ENTER( IPOIB_DBG_INIT 

);<BR> <BR>  p_port->state = 

IB_QPS_RESET;<BR>+    p_port->DontPostRecvs = 

TRUE;<BR> <BR>  cl_obj_construct( &p_port->obj, 

IPOIB_OBJ_PORT );<BR>  cl_spinlock_construct( 

&p_port->send_lock );<BR>@@ -643,6 +650,7 

@@<BR> }<BR> <BR> <BR>+<BR> static 

void<BR> __port_destroying(<BR>  IN    cl_obj_t* 

const    p_obj )<BR>@@ -676,7 +684,7 

@@<BR>  p_port = PARENT_STRUCT( p_obj, ipoib_port_t, obj 

);<BR> <BR>  /* Wait for all sends and receives to get flushed. 

*/<BR>- while( p_port->send_mgr.depth || p_port->recv_mgr.depth 

)<BR>+ while( p_port->send_mgr.depth || p_port->recv_mgr.depth1 

)<BR>   cl_thread_suspend( 0 );<BR> <BR>  /* 

Destroy the send and receive managers before closing the CA. */<BR>@@ -746,6 

+754,7 

@@<BR>  ib_qp_attr_t  qp_attr;<BR> <BR>  IPOIB_ENTER( 

IPOIB_DBG_INIT );<BR>+    IPOIB_TRACE_EXIT( 

IPOIB_DBG_ERROR,("__ib_mgr_init creating 

everything\n"));<BR> <BR>  /* Open the CA. 

*/<BR>  status = p_port->p_adapter->p_ifc->open_ca(<BR>@@ 

-1298,7 +1307,7 @@<BR>  IPOIB_ENTER( IPOIB_DBG_INIT 

);<BR> <BR>  CL_ASSERT( cl_is_qlist_empty( 

&p_port->recv_mgr.done_list ) );<BR>- CL_ASSERT( 

!p_port->recv_mgr.depth );<BR>+ CL_ASSERT( !p_port->recv_mgr.depth1 

);<BR> <BR>  if( p_port->recv_mgr.recv_pkt_array 

)<BR>   cl_free( p_port->recv_mgr.recv_pkt_array );<BR>@@ 

-1324,19 +1333,20 @@<BR>  CL_ASSERT( p_port 

);<BR>  CL_ASSERT( num_wrs );<BR>  cl_obj_lock( 

&p_port->obj );<BR>- if( p_port->state == IB_QPS_ERROR 

)<BR>+ if(( p_port->state == IB_QPS_ERROR ) || (p_port->DontPostRecvs 

== TRUE))<BR>  {<BR>   cl_obj_unlock( 

&p_port->obj );<BR>   IPOIB_TRACE_EXIT( 

IPOIB_DBG_RECV,<BR>    ("Port in invalid state.  Not 

reposting.\n") );<BR>   return 

IB_SUCCESS;<BR>  }<BR>+    ASSERT(p_port->state == 

IB_QPS_RTS);<BR>  cl_obj_ref( &p_port->obj 

);<BR>  cl_obj_unlock( &p_port->obj 

);<BR> <BR>  while( num_wrs-- 

)<BR>  {<BR>-  if( p_port->recv_mgr.depth == 

p_port->p_adapter->params.rq_depth )<BR>+  if( 

p_port->recv_mgr.depth1 == p_port->p_adapter->params.rq_depth 

)<BR>   {<BR>    IPOIB_TRACE( IPOIB_DBG_INFO 

| IPOIB_DBG_RECV,<BR>     ("Receive queue already 

full\n") );<BR>@@ -1366,7 +1376,7 @@<BR> <BR>   p_head = 

p_next;<BR> <BR>-  cl_atomic_inc( &p_port->recv_mgr.depth 

);<BR>+  cl_atomic_inc( &p_port->recv_mgr.depth1 

);<BR>  }<BR> <BR>  if( !p_head )<BR>@@ -1377,8 

+1387,27 @@<BR>  }<BR> <BR>  cl_perf_start( PostRecv 

);<BR>+    if( p_port->state == IB_QPS_ERROR ) 

{<BR>+        IPOIB_TRACE( 

IPOIB_DBG_ERROR,("****before Posting while in error"));<BR>+    

}<BR>+    {<BR>+        // 

Count the number of objects<BR>+        int i 

= 0;<BR>+        ib_recv_wr_t *pTemp = 

&p_head->wr;<BR>+        while (pTemp 

!= NULL) 

{<BR>+            

i++;<BR>+            

pTemp = pTemp->p_next;<BR>+        

}<BR>+        

InterlockedExchangeAdd(&g_CurrentlyPostedRecvs, 

i);<BR>+        

InterlockedExchangeAdd(&g_CurrentlyPostedFromLastResetReset, 

i);<BR>+<BR>+    }<BR>  status = 

p_port->p_adapter->p_ifc->post_recv(<BR>   p_port->ib_mgr.h_qp, 

&p_head->wr, &p_failed );<BR>+    if( p_port->state 

== IB_QPS_ERROR ) {<BR>+        IPOIB_TRACE( 

IPOIB_DBG_ERROR,("****After Posting while in error"));<BR>+    

}<BR>+<BR>  cl_perf_stop( &p_port->p_adapter->perf, PostRecv 

);<BR> <BR>  /*<BR>@@ -1397,7 +1426,8 

@@<BR>    p_failed = 

p_failed->p_next;<BR> <BR>    __buf_mgr_put_recv( 

p_port, p_head, NULL );<BR>-   cl_atomic_dec( 

&p_port->recv_mgr.depth );<BR>+   cl_atomic_dec( 

&p_port->recv_mgr.depth1 

);<BR>+            

CL_ASSERT(FALSE);<BR>   }<BR>  }<BR> <BR>@@ 

-1495,7 +1525,7 

@@<BR> {<BR>  ipoib_port_t  *p_port;<BR>  ib_api_status_t  status;<BR>- ib_wc_t    wc[MAX_RECV_WC], 

*p_free, *p_wc;<BR>+ ib_wc_t    wc[MAX_RECV_WC * 2], 

*p_free, *p_wc;<BR>  uint32_t   pkt_cnt, recv_cnt = 

0;<BR>  cl_qlist_t   done_list, 

bad_list;<BR>  size_t    i;<BR>@@ -1524,9 +1554,9 

@@<BR> <BR>  cl_obj_ref( &p_port->obj 

);<BR> <BR>- for( i = 0; i < MAX_RECV_WC; i++ )<BR>+ for( i = 

0; i < MAX_RECV_WC * 2; i++ ) //?????????? * 

2<BR>   wc[i].p_next = &wc[i + 1];<BR>- wc[MAX_RECV_WC - 

1].p_next = NULL;<BR>+ wc[MAX_RECV_WC * 2 - 1].p_next = NULL; //???????? 

X2<BR> <BR>  /*<BR>   * We'll be accessing the endpoint 

map so take a reference<BR>@@ -1558,7 +1588,7 @@<BR>  cl_atomic_dec( 

&p_port->endpt_rdr );<BR> <BR>  /* Update our posted 

depth. */<BR>- cl_atomic_sub( &p_port->recv_mgr.depth, recv_cnt 

);<BR>+ cl_atomic_sub( &p_port->recv_mgr.depth1, recv_cnt 

);<BR> <BR>  cl_perf_start( BuildPktArray );<BR>  /* 

Notify NDIS of any and all possible receive buffers. */<BR>@@ -1758,6 +1788,11 

@@<BR> <BR>  IPOIB_ENTER( IPOIB_DBG_RECV 

);<BR> <BR>+    for( p_wc = p_done_wc_list; p_wc; p_wc = 

p_wc->p_next ) {<BR>+        

InterlockedIncrement(&g_CurrentlyPolledEntries);<BR>+        

InterlockedIncrement(&g_CurrentlyPolledEntriesFromLastReset);<BR>+    

}<BR>+<BR>  for( p_wc = p_done_wc_list; p_wc; p_wc = p_wc->p_next 

)<BR>  {<BR>   CL_ASSERT( p_wc->wc_type == 

IB_WC_RECV );<BR>@@ -1772,6 +1807,7 

@@<BR>      ("Failed completion 

%s\n",<BR>      p_port->p_adapter->p_ifc->get_wc_status_str( 

p_wc->status )) );<BR>     ipoib_inc_recv_stat( 

p_port->p_adapter, IP_STAT_ERROR, 0 

);<BR>+                

ASSERT(FALSE);<BR>    }<BR>    else<BR>    {<BR>@@ 

-1780,7 +1816,9 

@@<BR>      p_port->p_adapter->p_ifc->get_wc_status_str( 

p_wc->status )) );<BR>     ipoib_inc_recv_stat( 

p_port->p_adapter, IP_STAT_DROPPED, 0 

);<BR>    }<BR>-   cl_qlist_insert_tail( 

p_bad_list, &p_desc->item.list_item 

);<BR>+//            if 

(p_bad_list->count < 0x80) {<BR>+    

   cl_qlist_insert_tail( p_bad_list, 

&p_desc->item.list_item 

);<BR>+//            

}<BR>    /* Dereference the port object on behalf of the 

failed receive. */<BR>    cl_obj_deref( &p_port->obj 

);<BR>    continue;<BR>@@ -4419,10 +4457,23 

@@<BR>  IPOIB_ENTER( IPOIB_DBG_INIT );<BR> <BR>  /* 

Wait for all receives to get flushed. */<BR>- while( 

p_port->recv_mgr.depth )<BR>+ while( p_port->recv_mgr.depth1 

)<BR>   cl_thread_suspend( 0 );<BR> <BR>+    

g_CurrentlyPolledEntriesFromLastReset = 0;<BR>+    

g_CurrentlyPostedFromLastResetReset = 0;<BR>+/*<BR>+    

__ib_mgr_destroy(p_port);<BR>+<BR>+ __recv_mgr_destroy( p_port 

);<BR>+// __send_mgr_destroy( p_port );<BR>+ __buf_mgr_destroy( p_port 

);<BR>+<BR>+<BR>+    

__port_init_short(p_port);<BR>+*/<BR>  p_port->state = 

IB_QPS_RTS;<BR>+    ASSERT(p_port->DontPostRecvs == 

TRUE);<BR> <BR>  info.method = 

IB_MAD_METHOD_GET;<BR>  info.attr_id = 

IB_MAD_ATTR_PORTINFO_RECORD;<BR>@@ -4736,6 +4787,9 

@@<BR>  ib_api_status_t  status;<BR>  ib_mcast_req_t  mcast_req;<BR> <BR>+<BR>+    

IPOIB_TRACE( IPOIB_DBG_ERROR,("__port_join_bcast called 

\n"));<BR>+<BR>  IPOIB_ENTER( IPOIB_DBG_MCAST 

);<BR> <BR>  /* Check that the rate is realizable for our port. 

*/<BR>@@ -4802,6 +4856,8 

@@<BR>  ib_api_status_t  status;<BR>  ib_mcast_req_t  mcast_req;<BR> <BR>+    

IPOIB_TRACE( IPOIB_DBG_ERROR,("__port_create_bcast called 

\n"));<BR>+<BR>  IPOIB_ENTER( IPOIB_DBG_MCAST 

);<BR> <BR>  /* Join the broadcast group. */<BR>@@ -4873,6 

+4929,7 @@<BR>  /* Mark our state.  This causes all callbacks to 

abort. */<BR>  cl_obj_lock( &p_port->obj 

);<BR>  p_port->state = IB_QPS_ERROR;<BR>+    

p_port->DontPostRecvs = 

TRUE;;<BR> <BR>  NdisWriteErrorLogEntry( 

p_port->p_adapter->h_adapter,<BR>   EVENT_IPOIB_PORT_DOWN, 

0 );<BR>@@ -4891,6 +4948,7 @@<BR>  CL_ASSERT( p_port->ib_mgr.h_qp 

);<BR>  cl_memclr( &qp_mod, sizeof(ib_qp_mod_t) 

);<BR>  qp_mod.req_state = IB_QPS_ERROR;<BR>+    

IPOIB_TRACE( IPOIB_DBG_ERROR,("Before changing to error, currently posted %d\n", 

p_port->recv_mgr.depth1));<BR>  status = 

p_port->p_adapter->p_ifc->modify_qp( p_port->ib_mgr.h_qp, 

&qp_mod );<BR>  if( status != IB_SUCCESS )<BR>  {<BR>@@ 

-4900,6 +4958,7 @@<BR>   p_port->p_adapter->hung = 

TRUE;<BR>   return;<BR>  }<BR>+    

IPOIB_TRACE( IPOIB_DBG_ERROR,("After changing to error, currently posted %d\n", 

p_port->recv_mgr.depth1));<BR> <BR>  /* Reset all endpoints so 

we don't flush our ARP cache. */<BR>  __endpt_mgr_reset_all( p_port 

);<BR>@@ -4981,6 +5040,8 

@@<BR>   return;<BR>  }<BR> <BR>+<BR>+<BR>  /* 

Get the QP ready for action. */<BR>  status = __ib_mgr_activate( 

p_port );<BR>  if( status != IB_SUCCESS )<BR>@@ -4996,6 +5057,8 

@@<BR>  }<BR> <BR>  /* Notify the adapter that we now 

have an active connection. */<BR>+<BR>+    

<BR>  ipoib_set_active( p_port->p_adapter 

);<BR> <BR>  cl_obj_deref( &p_port->obj );<BR>@@ -5039,6 

+5102,11 @@<BR>   * Move the QP to RESET.  This allows us to 

reclaim any<BR>   * unflushed receives.<BR>   

*/<BR>+<BR>+    ASSERT(g_CurrentlyPostedFromLastResetReset == 

0);<BR>+    ASSERT(g_CurrentlyPolledEntriesFromLastReset == 

0);<BR>+    ASSERT( p_port->recv_mgr.depth1 == 0 

);<BR>+   <BR>  cl_memclr( &qp_mod, sizeof(ib_qp_mod_t) 

);<BR>  qp_mod.req_state = IB_QPS_RESET;<BR>  status = 

p_port->p_adapter->p_ifc->modify_qp( p_port->ib_mgr.h_qp, 

&qp_mod );<BR>@@ -5083,6 +5151,7 @@<BR> <BR>  /* Prepost 

receives. */<BR>  cl_spinlock_acquire( &p_port->recv_lock 

);<BR>+    p_port->DontPostRecvs = 

FALSE;<BR>  __recv_mgr_repost( p_port, 

p_port->p_adapter->params.rq_depth );<BR>  cl_spinlock_release( 

&p_port->recv_lock );<BR> <BR>Index: 

ipoib_port.h<BR>===================================================================<BR>--- 

ipoib_port.h (revision 331)<BR>+++ ipoib_port.h (working copy)<BR>@@ 

-406,7 +406,7 @@<BR> <BR> typedef struct 

_ipoib_recv_mgr<BR> {<BR>- atomic32_t  depth;<BR>+ atomic32_t  depth1;<BR> <BR>  NDIS_PACKET  **recv_pkt_array;<BR> <BR>@@ 

-475,6 +475,7 

@@<BR>  cl_obj_rel_t   rel;<BR> <BR>  ib_qp_state_t   state;<BR>+    

BOOLEAN                    

DontPostRecvs;<BR> <BR>  cl_spinlock_t   recv_lock;<BR>  cl_spinlock_t   send_lock;<BR></SPAN></FONT></DIV></BODY></HTML>