[ofw][patches][IBAL] work around for reference count leakage bugs
Leonid Keller
leonid at mellanox.co.il
Mon Jun 1 09:30:55 PDT 2009
No-no, it's not a new problem.
We are reported from time to time and we also saw recently (while WHQL
runs) situations, where IBAL gets stuck in sync_destroy_obj, waiting
endlessly for release of the ref_cnt of an object.
Usually it is connected to some unreleased MADs.
These situations are usually hard reproducable and we continue to live
with these uncovered bugs.
In checked version IBAL doesn't wait endlessly: after some timeout it
forces object destruction.
This patch makes IBAL to behave the same way also in free version,
reporting the problem to EventLog.
________________________________
From: Smith, Stan [mailto:stan.smith at intel.com]
Sent: Monday, June 01, 2009 7:06 PM
To: Leonid Keller
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw][patches][IBAL] work around for reference
count leakage bugs
Hello Leo,
At what svn revision did you 1st start seeing the refcnt
leakage?
Specifically @ svn.2221 (mthca.sys) I do not see refcnt leakage
ASSERTS/problems in HCA disable or system shutdown?
Any ideas as to what has changed?
thanks,
stan.
________________________________
From: ofw-bounces at lists.openfabrics.org
[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Leonid Keller
Sent: Monday, June 01, 2009 8:41 AM
To: ofw at lists.openfabrics.org
Subject: [ofw][patches][IBAL] work around for reference count
leakage bugs
IBAL still has bugs, which cause reference count leakage, which
stops the cascading destroying resources of IBAL.
It causes in turn a freeze of IBBUS on HCA disable or system
power down.
On checked builds IBAL forces destroying of the objects after
some timeout.
On free version it waits endlessly.
This patch makes the behavior of free version to be the same in
checked version while sending a message to System Event Log.
Index: V:/svn/winib/trunk/core/al/al_common.c
===================================================================
--- V:/svn/winib/trunk/core/al/al_common.c (revision 4403)
+++ V:/svn/winib/trunk/core/al/al_common.c (revision 4404)
@@ -35,6 +35,7 @@
#include "al_ci_ca.h"
#include "al_common.h"
#include "al_debug.h"
+#include "al_ca.h"
#if defined(EVENT_TRACING)
#ifdef offsetof
@@ -46,6 +47,7 @@
#include "al_mgr.h"
#include <complib/cl_math.h>
#include "ib_common.h"
+#include "bus_ev_log.h"
@@ -498,7 +500,6 @@
if( deref_al_obj( p_obj ) )
{
- #ifdef _DEBUG_
uint32_t wait_us;
/*
* Wait for all other references to go away. We wait as long
as the
@@ -529,13 +530,11 @@
&p_obj->event, AL_MAX_TIMEOUT_US, AL_WAIT_ALERTABLE );
} while( cl_status == CL_NOT_DONE );
}
- #else
- do
- {
- cl_status = cl_event_wait_on(
- &p_obj->event, EVENT_NO_TIMEOUT, AL_WAIT_ALERTABLE );
- } while( cl_status == CL_NOT_DONE );
- #endif
+ if ( p_obj->p_ci_ca && p_obj->p_ci_ca )
+ CL_PRINT_TO_EVENT_LOG( p_obj->p_ci_ca->h_ca->p_fdo,
EVENT_IBBUS_ANY_ERROR,
+ ("IBAL stuck: AL object %s, ref_cnt: %d. Forcing object
destruction.\n",
+ ib_get_obj_type( p_obj ), p_obj->ref_cnt));
+
CL_ASSERT( cl_status == CL_SUCCESS );
if( cl_status != CL_SUCCESS )
{
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090601/6ba1975c/attachment.html>
More information about the ofw
mailing list