[ofw][patches][IBAL] work around for reference count leakage bugs

Leonid Keller leonid at mellanox.co.il
Mon Jun 1 09:30:55 PDT 2009


No-no, it's not a new problem.
We are reported from time to time and we also saw recently (while WHQL
runs) situations, where IBAL gets stuck in sync_destroy_obj, waiting
endlessly for release of the ref_cnt of an object.
Usually it is connected to some unreleased MADs.
These situations are usually hard reproducable and we continue to live
with these uncovered bugs.
In checked version IBAL doesn't wait endlessly: after some timeout it
forces object destruction.
This patch makes IBAL to behave the same way also in free version,
reporting the problem to EventLog.
 


________________________________

	From: Smith, Stan [mailto:stan.smith at intel.com] 
	Sent: Monday, June 01, 2009 7:06 PM
	To: Leonid Keller
	Cc: ofw at lists.openfabrics.org
	Subject: RE: [ofw][patches][IBAL] work around for reference
count leakage bugs
	
	
	Hello Leo,
	  At what svn revision did you 1st start seeing the refcnt
leakage? 
	 
	Specifically @ svn.2221 (mthca.sys) I do not see refcnt leakage
ASSERTS/problems in HCA disable or system shutdown?
	 
	Any ideas as to what has changed?
	 
	thanks,
	 
	stan.

________________________________

	From: ofw-bounces at lists.openfabrics.org
[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Leonid Keller
	Sent: Monday, June 01, 2009 8:41 AM
	To: ofw at lists.openfabrics.org
	Subject: [ofw][patches][IBAL] work around for reference count
leakage bugs
	
	
	IBAL still has bugs, which cause reference count leakage, which
stops the cascading destroying resources of IBAL.
	It causes  in turn a freeze of IBBUS on HCA disable or system
power down.
	On checked builds IBAL forces destroying of the objects after
some timeout.
	On free version it waits endlessly.
	This patch makes the behavior of free version to be the same in
checked version while sending a message to System Event Log.
	 
	 
	 
	Index: V:/svn/winib/trunk/core/al/al_common.c
	
===================================================================
	--- V:/svn/winib/trunk/core/al/al_common.c (revision 4403)
	+++ V:/svn/winib/trunk/core/al/al_common.c (revision 4404)
	@@ -35,6 +35,7 @@
	 #include "al_ci_ca.h"
	 #include "al_common.h"
	 #include "al_debug.h"
	+#include "al_ca.h"
	 
	 #if defined(EVENT_TRACING)
	 #ifdef offsetof
	@@ -46,6 +47,7 @@
	 #include "al_mgr.h"
	 #include <complib/cl_math.h>
	 #include "ib_common.h"
	+#include "bus_ev_log.h"
	 
	 
	 
	@@ -498,7 +500,6 @@
	 
	  if( deref_al_obj( p_obj ) )
	  {
	- #ifdef _DEBUG_
	   uint32_t  wait_us;
	   /*
	    * Wait for all other references to go away.  We wait as long
as the
	@@ -529,13 +530,11 @@
	      &p_obj->event, AL_MAX_TIMEOUT_US, AL_WAIT_ALERTABLE );
	    } while( cl_status == CL_NOT_DONE );
	   }
	- #else
	-  do
	-  {
	-   cl_status = cl_event_wait_on(
	-    &p_obj->event, EVENT_NO_TIMEOUT, AL_WAIT_ALERTABLE );
	-  } while( cl_status == CL_NOT_DONE );
	- #endif
	+  if ( p_obj->p_ci_ca && p_obj->p_ci_ca )
	+   CL_PRINT_TO_EVENT_LOG( p_obj->p_ci_ca->h_ca->p_fdo,
EVENT_IBBUS_ANY_ERROR,
	+    ("IBAL stuck: AL object %s, ref_cnt: %d. Forcing object
destruction.\n",
	+    ib_get_obj_type( p_obj ), p_obj->ref_cnt));
	+
	   CL_ASSERT( cl_status == CL_SUCCESS );
	   if( cl_status != CL_SUCCESS )
	   {
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090601/6ba1975c/attachment.html>


More information about the ofw mailing list