<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>(* I am writing on
memfree cards, however this race exists probably on cards with memory. Due to
some other bug one notices it only on memfree cards).</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Hi
Fab,</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>While debugging
IPOIB flows that use stopping and starting open sm we have came to a conclusion
that there is a window in time in which revs packets shouldn't be posted, still
they are.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Before I start in
the explanation I want to make sure that we all understand the different steps
of destroying a QP (I guess we all know this, but just in
case).</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>In the first step
the QP is in RTS, one can post recvs and get good answers.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>When one wants to
"close" the QP, he will have to move the QP to error state. All posted recvs
will be shutdown, and new posted recvs will one day be also flushed. (no way to
know exactly when, but probably in a Us or so).</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Later the QP is
being brought to reset, at this stage, posted recv packets are lost.
</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Next the QP is in
the init stage again.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>There is a point
people might forget, that once a QP is in an error stage, the HCA is pooling for
completions, still if one changes the state too fast, he can't tell if they are
all there or not.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>In any case, looking
at the IPOIB code, there are two important fields: 1) recv_mgr-> depth. 2 )
port->state. once the port goes down the state changes to error, and on the
function port up you have a busy loop that waits for all packets to complete. In
the function post_recv (so far so good) the state field is checked (but not
atomically). </SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>The problem starts
when you change the port->state to be in RTS while the QP is still in ERROR
and later RESET stages. There is a window in time that you can post recvs, but
you can't be sure to get completion on.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>I have added some
more debug prints and asserts that show this bug. You can reproduce it using the
test in the end of this mail. My patch also contains a simple work around for
the problem (the work around makes the window smaller, it doesn't kill it all
together). To see only the work around look for the field DontPostRecvs in the
code (only 4 lines or so).</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>What do you
think?</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006>Thanks</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006>Tzachi</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>By the way, there
are probably a few ways to fix this bug, and as a maintainer you will have to
decide the best way, however here is an idea that I have to solve the bug
without introducing new fields or taking an extra lock.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>currently the depth
field tells the number of packets that are currently posted and is always a
positive number. My change says that when the QP is active, use the same
algorithm. When the QP is going to be destroyed, remove (atomically) 1,000,000
from the depth field. You will reach a negative number for sure. This will mark
that the QP is in closing process. Please note that if the number you get is
bigger than -1,000,000 this means that there are posted recvs that were not yet
pooled, and you have to keep pooling until you reach -1000,000. One more change
that you should do in this case is before pooling do the atomic add of the
number pooled. If after the add you get a negative number don't post the recevs
(QP already started to close) and decrement the number just
added.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>If you think that
this idea is good I can send a patch for you.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Here is the script
that restarts open sm:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>D:\Documents and
Settings\herod>type testsm.cmd<BR>:start<BR>start
opensm<BR>q:\unixtools\sleep 3<BR>kill opensm<BR>q:\unixtools\sleep 1<BR>goto
start</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN
class=935384421-30042006></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>The patch that
explains the problem:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=935384421-30042006>Index:
ipoib_port.c<BR>===================================================================<BR>---
ipoib_port.c (revision 331)<BR>+++ ipoib_port.c (working copy)<BR>@@
-35,7 +35,13 @@<BR> #include "ipoib_adapter.h"<BR> #include
"ipoib_debug.h"<BR> <BR>+long g_CurrentlyPostedRecvs = 0;<BR>+long
g_CurrentlyPolledEntries = 0;<BR> <BR>+long
g_CurrentlyPostedFromLastResetReset = 0;<BR>+long
g_CurrentlyPolledEntriesFromLastReset = 0;<BR>+<BR>+<BR> /*<BR> * PR
102492 relates to the SA not accepting wildcard values for
MCMemberRecord<BR> * Sets used to create/join Multicast groups.
Defining this keyword causes all<BR>@@ -540,6 +546,7
@@<BR> IPOIB_ENTER( IPOIB_DBG_INIT
);<BR> <BR> p_port->state =
IB_QPS_RESET;<BR>+ p_port->DontPostRecvs =
TRUE;<BR> <BR> cl_obj_construct( &p_port->obj,
IPOIB_OBJ_PORT );<BR> cl_spinlock_construct(
&p_port->send_lock );<BR>@@ -643,6 +650,7
@@<BR> }<BR> <BR> <BR>+<BR> static
void<BR> __port_destroying(<BR> IN cl_obj_t*
const p_obj )<BR>@@ -676,7 +684,7
@@<BR> p_port = PARENT_STRUCT( p_obj, ipoib_port_t, obj
);<BR> <BR> /* Wait for all sends and receives to get flushed.
*/<BR>- while( p_port->send_mgr.depth || p_port->recv_mgr.depth
)<BR>+ while( p_port->send_mgr.depth || p_port->recv_mgr.depth1
)<BR> cl_thread_suspend( 0 );<BR> <BR> /*
Destroy the send and receive managers before closing the CA. */<BR>@@ -746,6
+754,7
@@<BR> ib_qp_attr_t qp_attr;<BR> <BR> IPOIB_ENTER(
IPOIB_DBG_INIT );<BR>+ IPOIB_TRACE_EXIT(
IPOIB_DBG_ERROR,("__ib_mgr_init creating
everything\n"));<BR> <BR> /* Open the CA.
*/<BR> status = p_port->p_adapter->p_ifc->open_ca(<BR>@@
-1298,7 +1307,7 @@<BR> IPOIB_ENTER( IPOIB_DBG_INIT
);<BR> <BR> CL_ASSERT( cl_is_qlist_empty(
&p_port->recv_mgr.done_list ) );<BR>- CL_ASSERT(
!p_port->recv_mgr.depth );<BR>+ CL_ASSERT( !p_port->recv_mgr.depth1
);<BR> <BR> if( p_port->recv_mgr.recv_pkt_array
)<BR> cl_free( p_port->recv_mgr.recv_pkt_array );<BR>@@
-1324,19 +1333,20 @@<BR> CL_ASSERT( p_port
);<BR> CL_ASSERT( num_wrs );<BR> cl_obj_lock(
&p_port->obj );<BR>- if( p_port->state == IB_QPS_ERROR
)<BR>+ if(( p_port->state == IB_QPS_ERROR ) || (p_port->DontPostRecvs
== TRUE))<BR> {<BR> cl_obj_unlock(
&p_port->obj );<BR> IPOIB_TRACE_EXIT(
IPOIB_DBG_RECV,<BR> ("Port in invalid state. Not
reposting.\n") );<BR> return
IB_SUCCESS;<BR> }<BR>+ ASSERT(p_port->state ==
IB_QPS_RTS);<BR> cl_obj_ref( &p_port->obj
);<BR> cl_obj_unlock( &p_port->obj
);<BR> <BR> while( num_wrs--
)<BR> {<BR>- if( p_port->recv_mgr.depth ==
p_port->p_adapter->params.rq_depth )<BR>+ if(
p_port->recv_mgr.depth1 == p_port->p_adapter->params.rq_depth
)<BR> {<BR> IPOIB_TRACE( IPOIB_DBG_INFO
| IPOIB_DBG_RECV,<BR> ("Receive queue already
full\n") );<BR>@@ -1366,7 +1376,7 @@<BR> <BR> p_head =
p_next;<BR> <BR>- cl_atomic_inc( &p_port->recv_mgr.depth
);<BR>+ cl_atomic_inc( &p_port->recv_mgr.depth1
);<BR> }<BR> <BR> if( !p_head )<BR>@@ -1377,8
+1387,27 @@<BR> }<BR> <BR> cl_perf_start( PostRecv
);<BR>+ if( p_port->state == IB_QPS_ERROR )
{<BR>+ IPOIB_TRACE(
IPOIB_DBG_ERROR,("****before Posting while in error"));<BR>+
}<BR>+ {<BR>+ //
Count the number of objects<BR>+ int i
= 0;<BR>+ ib_recv_wr_t *pTemp =
&p_head->wr;<BR>+ while (pTemp
!= NULL)
{<BR>+
i++;<BR>+
pTemp = pTemp->p_next;<BR>+
}<BR>+
InterlockedExchangeAdd(&g_CurrentlyPostedRecvs,
i);<BR>+
InterlockedExchangeAdd(&g_CurrentlyPostedFromLastResetReset,
i);<BR>+<BR>+ }<BR> status =
p_port->p_adapter->p_ifc->post_recv(<BR> p_port->ib_mgr.h_qp,
&p_head->wr, &p_failed );<BR>+ if( p_port->state
== IB_QPS_ERROR ) {<BR>+ IPOIB_TRACE(
IPOIB_DBG_ERROR,("****After Posting while in error"));<BR>+
}<BR>+<BR> cl_perf_stop( &p_port->p_adapter->perf, PostRecv
);<BR> <BR> /*<BR>@@ -1397,7 +1426,8
@@<BR> p_failed =
p_failed->p_next;<BR> <BR> __buf_mgr_put_recv(
p_port, p_head, NULL );<BR>- cl_atomic_dec(
&p_port->recv_mgr.depth );<BR>+ cl_atomic_dec(
&p_port->recv_mgr.depth1
);<BR>+
CL_ASSERT(FALSE);<BR> }<BR> }<BR> <BR>@@
-1495,7 +1525,7
@@<BR> {<BR> ipoib_port_t *p_port;<BR> ib_api_status_t status;<BR>- ib_wc_t wc[MAX_RECV_WC],
*p_free, *p_wc;<BR>+ ib_wc_t wc[MAX_RECV_WC * 2],
*p_free, *p_wc;<BR> uint32_t pkt_cnt, recv_cnt =
0;<BR> cl_qlist_t done_list,
bad_list;<BR> size_t i;<BR>@@ -1524,9 +1554,9
@@<BR> <BR> cl_obj_ref( &p_port->obj
);<BR> <BR>- for( i = 0; i < MAX_RECV_WC; i++ )<BR>+ for( i =
0; i < MAX_RECV_WC * 2; i++ ) //?????????? *
2<BR> wc[i].p_next = &wc[i + 1];<BR>- wc[MAX_RECV_WC -
1].p_next = NULL;<BR>+ wc[MAX_RECV_WC * 2 - 1].p_next = NULL; //????????
X2<BR> <BR> /*<BR> * We'll be accessing the endpoint
map so take a reference<BR>@@ -1558,7 +1588,7 @@<BR> cl_atomic_dec(
&p_port->endpt_rdr );<BR> <BR> /* Update our posted
depth. */<BR>- cl_atomic_sub( &p_port->recv_mgr.depth, recv_cnt
);<BR>+ cl_atomic_sub( &p_port->recv_mgr.depth1, recv_cnt
);<BR> <BR> cl_perf_start( BuildPktArray );<BR> /*
Notify NDIS of any and all possible receive buffers. */<BR>@@ -1758,6 +1788,11
@@<BR> <BR> IPOIB_ENTER( IPOIB_DBG_RECV
);<BR> <BR>+ for( p_wc = p_done_wc_list; p_wc; p_wc =
p_wc->p_next ) {<BR>+
InterlockedIncrement(&g_CurrentlyPolledEntries);<BR>+
InterlockedIncrement(&g_CurrentlyPolledEntriesFromLastReset);<BR>+
}<BR>+<BR> for( p_wc = p_done_wc_list; p_wc; p_wc = p_wc->p_next
)<BR> {<BR> CL_ASSERT( p_wc->wc_type ==
IB_WC_RECV );<BR>@@ -1772,6 +1807,7
@@<BR> ("Failed completion
%s\n",<BR> p_port->p_adapter->p_ifc->get_wc_status_str(
p_wc->status )) );<BR> ipoib_inc_recv_stat(
p_port->p_adapter, IP_STAT_ERROR, 0
);<BR>+
ASSERT(FALSE);<BR> }<BR> else<BR> {<BR>@@
-1780,7 +1816,9
@@<BR> p_port->p_adapter->p_ifc->get_wc_status_str(
p_wc->status )) );<BR> ipoib_inc_recv_stat(
p_port->p_adapter, IP_STAT_DROPPED, 0
);<BR> }<BR>- cl_qlist_insert_tail(
p_bad_list, &p_desc->item.list_item
);<BR>+// if
(p_bad_list->count < 0x80) {<BR>+
cl_qlist_insert_tail( p_bad_list,
&p_desc->item.list_item
);<BR>+//
}<BR> /* Dereference the port object on behalf of the
failed receive. */<BR> cl_obj_deref( &p_port->obj
);<BR> continue;<BR>@@ -4419,10 +4457,23
@@<BR> IPOIB_ENTER( IPOIB_DBG_INIT );<BR> <BR> /*
Wait for all receives to get flushed. */<BR>- while(
p_port->recv_mgr.depth )<BR>+ while( p_port->recv_mgr.depth1
)<BR> cl_thread_suspend( 0 );<BR> <BR>+
g_CurrentlyPolledEntriesFromLastReset = 0;<BR>+
g_CurrentlyPostedFromLastResetReset = 0;<BR>+/*<BR>+
__ib_mgr_destroy(p_port);<BR>+<BR>+ __recv_mgr_destroy( p_port
);<BR>+// __send_mgr_destroy( p_port );<BR>+ __buf_mgr_destroy( p_port
);<BR>+<BR>+<BR>+
__port_init_short(p_port);<BR>+*/<BR> p_port->state =
IB_QPS_RTS;<BR>+ ASSERT(p_port->DontPostRecvs ==
TRUE);<BR> <BR> info.method =
IB_MAD_METHOD_GET;<BR> info.attr_id =
IB_MAD_ATTR_PORTINFO_RECORD;<BR>@@ -4736,6 +4787,9
@@<BR> ib_api_status_t status;<BR> ib_mcast_req_t mcast_req;<BR> <BR>+<BR>+
IPOIB_TRACE( IPOIB_DBG_ERROR,("__port_join_bcast called
\n"));<BR>+<BR> IPOIB_ENTER( IPOIB_DBG_MCAST
);<BR> <BR> /* Check that the rate is realizable for our port.
*/<BR>@@ -4802,6 +4856,8
@@<BR> ib_api_status_t status;<BR> ib_mcast_req_t mcast_req;<BR> <BR>+
IPOIB_TRACE( IPOIB_DBG_ERROR,("__port_create_bcast called
\n"));<BR>+<BR> IPOIB_ENTER( IPOIB_DBG_MCAST
);<BR> <BR> /* Join the broadcast group. */<BR>@@ -4873,6
+4929,7 @@<BR> /* Mark our state. This causes all callbacks to
abort. */<BR> cl_obj_lock( &p_port->obj
);<BR> p_port->state = IB_QPS_ERROR;<BR>+
p_port->DontPostRecvs =
TRUE;;<BR> <BR> NdisWriteErrorLogEntry(
p_port->p_adapter->h_adapter,<BR> EVENT_IPOIB_PORT_DOWN,
0 );<BR>@@ -4891,6 +4948,7 @@<BR> CL_ASSERT( p_port->ib_mgr.h_qp
);<BR> cl_memclr( &qp_mod, sizeof(ib_qp_mod_t)
);<BR> qp_mod.req_state = IB_QPS_ERROR;<BR>+
IPOIB_TRACE( IPOIB_DBG_ERROR,("Before changing to error, currently posted %d\n",
p_port->recv_mgr.depth1));<BR> status =
p_port->p_adapter->p_ifc->modify_qp( p_port->ib_mgr.h_qp,
&qp_mod );<BR> if( status != IB_SUCCESS )<BR> {<BR>@@
-4900,6 +4958,7 @@<BR> p_port->p_adapter->hung =
TRUE;<BR> return;<BR> }<BR>+
IPOIB_TRACE( IPOIB_DBG_ERROR,("After changing to error, currently posted %d\n",
p_port->recv_mgr.depth1));<BR> <BR> /* Reset all endpoints so
we don't flush our ARP cache. */<BR> __endpt_mgr_reset_all( p_port
);<BR>@@ -4981,6 +5040,8
@@<BR> return;<BR> }<BR> <BR>+<BR>+<BR> /*
Get the QP ready for action. */<BR> status = __ib_mgr_activate(
p_port );<BR> if( status != IB_SUCCESS )<BR>@@ -4996,6 +5057,8
@@<BR> }<BR> <BR> /* Notify the adapter that we now
have an active connection. */<BR>+<BR>+
<BR> ipoib_set_active( p_port->p_adapter
);<BR> <BR> cl_obj_deref( &p_port->obj );<BR>@@ -5039,6
+5102,11 @@<BR> * Move the QP to RESET. This allows us to
reclaim any<BR> * unflushed receives.<BR>
*/<BR>+<BR>+ ASSERT(g_CurrentlyPostedFromLastResetReset ==
0);<BR>+ ASSERT(g_CurrentlyPolledEntriesFromLastReset ==
0);<BR>+ ASSERT( p_port->recv_mgr.depth1 == 0
);<BR>+ <BR> cl_memclr( &qp_mod, sizeof(ib_qp_mod_t)
);<BR> qp_mod.req_state = IB_QPS_RESET;<BR> status =
p_port->p_adapter->p_ifc->modify_qp( p_port->ib_mgr.h_qp,
&qp_mod );<BR>@@ -5083,6 +5151,7 @@<BR> <BR> /* Prepost
receives. */<BR> cl_spinlock_acquire( &p_port->recv_lock
);<BR>+ p_port->DontPostRecvs =
FALSE;<BR> __recv_mgr_repost( p_port,
p_port->p_adapter->params.rq_depth );<BR> cl_spinlock_release(
&p_port->recv_lock );<BR> <BR>Index:
ipoib_port.h<BR>===================================================================<BR>---
ipoib_port.h (revision 331)<BR>+++ ipoib_port.h (working copy)<BR>@@
-406,7 +406,7 @@<BR> <BR> typedef struct
_ipoib_recv_mgr<BR> {<BR>- atomic32_t depth;<BR>+ atomic32_t depth1;<BR> <BR> NDIS_PACKET **recv_pkt_array;<BR> <BR>@@
-475,6 +475,7
@@<BR> cl_obj_rel_t rel;<BR> <BR> ib_qp_state_t state;<BR>+
BOOLEAN
DontPostRecvs;<BR> <BR> cl_spinlock_t recv_lock;<BR> cl_spinlock_t send_lock;<BR></SPAN></FONT></DIV></BODY></HTML>