[ofw] crash on IBBUS disabling while mad traffic
Leonid Keller
leonid at mellanox.co.il
Thu May 21 03:06:27 PDT 2009
Hi Stan,
Thanks a lot on the debug info.
I believe, i understood the problem.
The algorithm of sweeping passes several times through the point, where
I added the increment of ref_cnt (and not once, as i thought).
Find below and attached a new patch, that solves the original problem a
bit simpler.
It takes advance of the fact, that all the sweep processing is managed
by one ioc_sweep_results_t structrure.
The patch increments ref_cnt only once, just after allocating a
structure, and decrements it only once, just before releasing the
structure.
It should work unless this structure get leaked in some scenario.
I'd very thankful to you, if you could check it. (i still don't have SRP
environment).
Please, pay attention, that it is a patch for the original (commited to
SVN) patch and not for one, i sent you in the previous e-mail.
TIA
Index: core/al/kernel/al_ioc_pnp.c
===================================================================
--- core/al/kernel/al_ioc_pnp.c (revision 2191)
+++ core/al/kernel/al_ioc_pnp.c (working copy)
@@ -1720,6 +1720,11 @@
p_results->p_svc = p_svc;
cl_fmap_init( &p_results->iou_map, __iou_cmp );
+ /* Reference the service till the end of sweep processing */
+ ref_al_obj( &p_results->p_svc->obj );
+ cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+ KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc,
p_results->p_svc->obj.ref_cnt);
+
/* Build the map of nodes by port GUID. */
__process_nodes( p_svc, &port_map );
@@ -1749,6 +1754,10 @@
break;
default:
CL_ASSERT( p_results );
+ /* Release the reference taken for the sweep. */
+ deref_al_obj( &p_results->p_svc->obj );
+ cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+ KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc,
p_results->p_svc->obj.ref_cnt);
cl_free( p_results );
/* Fall through */
case IB_INSUFFICIENT_MEMORY:
@@ -2034,8 +2043,6 @@
if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) &&
status == IB_SUCCESS )
{
- /* Reference the service till the end of processing in the thread
*/
- ref_al_obj( &p_results->p_svc->obj );
cl_async_proc_queue( gp_async_pnp_mgr,
&p_results->async_item );
}
@@ -2231,11 +2238,8 @@
* If this is the last MAD, finish processing the IOU queries
* in the PnP thread.
*/
- if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
- /* Reference the service till the end of processing in the thread */
- ref_al_obj( &p_results->p_svc->obj );
+ if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) )
cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item );
- }
AL_EXIT( AL_DBG_PNP );
}
@@ -2356,8 +2360,10 @@
err:
if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt ) )
cl_async_proc_queue( gp_async_pnp_mgr, &gp_ioc_pnp->async_item );
- /* Release the reference taken for the query. */
+ /* Release the reference taken for the sweep. */
deref_al_obj( &p_results->p_svc->obj );
+ cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+ KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc,
p_results->p_svc->obj.ref_cnt);
cl_free( p_results );
}
________________________________
From: Smith, Stan [mailto:stan.smith at intel.com]
Sent: Wednesday, May 20, 2009 10:15 PM
To: Leonid Keller; Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad traffic
Hello Leo,
Here's the requested debug info, one HCA enable/disable cycle
with No IOUs available and one cycle with 2 IOUs available.
stan.
No IOUs available - No HCAs enabled
Enable HCA0
Before polling.
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 82136008, p_svc
825F47E8, ref_cnt 3
~1:[IBBUS] __process_sweep() : p_results 82136008, p_svc
825F47E8, ref_cnt 2
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 82145840, p_svc
825F47E8, ref_cnt 3
~1:[IBBUS] __process_sweep() : p_results 82145840, p_svc
825F47E8, ref_cnt 2
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 82145840, p_svc
825F47E8, ref_cnt 3
~1:[IBBUS] __process_sweep() : p_results 82145840, p_svc
825F47E8, ref_cnt 2
Disable HCA0
[AL]:al_cleanup(): Goodbye Cruel World =(
Signaled to stop polling.
Polling thread terminated.
<Enabled SRP Linux target[2 IOUs]>
<restart opensm on the SRP target node>
<reboot test system>
2 IOUs available - No HCAs enabled
Enable HCA0
Before polling.
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 820F3280, p_svc
820360B8, ref_cnt 3
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 820F3280, p_svc
820360B8, ref_cnt 4
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 820F3280, p_svc
820360B8, ref_cnt 5
~1:[IBBUS] __process_sweep() : p_results 820F3280, p_svc
820360B8, ref_cnt 4
Cancelled Found New Hardware Wizard: SRP driver load - same hang
results if SRP driver pre-installed.
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 820F3280, p_svc
820360B8, ref_cnt 6
~0:[IBBUS] __ioc_pnp_send_cb() : p_results 820F3280, p_svc
820360B8, ref_cnt 7
~1:[IBBUS] __process_sweep() : p_results 820F3280, p_svc
820360B8, ref_cnt 6
Disable HCA0
[AL]sync_destroy_obj() !ERROR!: Error waiting for references to
be released - delaying.
[AL]print_al_obj() !ERROR!: AL object
0000000082096008(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt:
3
*** Assertion failed: cl_status == CL_SUCCESS
*** Source File:
f:\openib-windows-svn\latest\gen1\trunk\core\al\al_common.c, line 539
Break repeatedly, break Once, Ignore, terminate Process, or
terminate Thread (boipt)? i
i
[AL]sync_destroy_obj() !ERROR!: Forcing object destruction.
[AL]print_al_obj() !ERROR!: AL object
0000000082096008(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000
ref_cnt: 3
[AL]print_al_obj() !ERROR!: AL object
00000000822103c0(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082096008
ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object
00000000820360b8(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 00000000822103c0
ref_cnt: 4
[AL]print_al_obj() !ERROR!: AL object
00000000822103c0(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082096008
ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object
00000000820360b8(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 00000000822103c0
ref_cnt: 4
[AL]:al_cleanup(): Goodbye Cruel World =(
Signaled to stop polling.
Polling thread terminated.
________________________________
From: Leonid Keller [mailto:leonid at mellanox.co.il]
Sent: Tuesday, May 19, 2009 6:42 AM
To: Smith, Stan; Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad traffic
Hi Stan,
Thank you for the info.
Unfortunately, I don't have now a setup with IOU devices and
can't investigate it.
Maybe you will have a possibility to do a check for me.
To remind: my patch was very simple (only 3 lines):
i increment ref_cnt of the sweeping thread before running it
and decrement it at the end of the sweep handling.
Your data show, that this ref_cnt is not zero.
So either it can be incremented twice in a row or the thread can
exit without decrementing ref_cnt.
I don't see how it can happen.
I'd like you to apply the below patch, make two runs - without
and with IOUs - and send me the debug output of both.
TIA
Index: core/al/kernel/al_ioc_pnp.c
===================================================================
--- core/al/kernel/al_ioc_pnp.c (revision 2162)
+++ core/al/kernel/al_ioc_pnp.c (working copy)
@@ -2036,6 +2036,8 @@
{
/* Reference the service till the end of processing in the
thread */
ref_al_obj( &p_results->p_svc->obj );
+ cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p,
ref_cnt %d",
+ KeGetCurrentProcessorNumber(), __FUNCTION__, p_results,
p_svc, p_results->p_svc->obj.ref_cnt);
cl_async_proc_queue( gp_async_pnp_mgr,
&p_results->async_item );
}
@@ -2234,6 +2236,8 @@
if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
/* Reference the service till the end of processing in the
thread */
ref_al_obj( &p_results->p_svc->obj );
+ cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p,
ref_cnt %d",
+ KeGetCurrentProcessorNumber(), __FUNCTION__, p_results,
p_svc, p_results->p_svc->obj.ref_cnt);
cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item
);
}
@@ -2358,6 +2362,8 @@
cl_async_proc_queue( gp_async_pnp_mgr,
&gp_ioc_pnp->async_item );
/* Release the reference taken for the query. */
deref_al_obj( &p_results->p_svc->obj );
+ cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p,
ref_cnt %d",
+ KeGetCurrentProcessorNumber(), __FUNCTION__, p_results,
p_svc, p_results->p_svc->obj.ref_cnt);
cl_free( p_results );
}
________________________________
From: Smith, Stan [mailto:stan.smith at intel.com]
Sent: Monday, May 18, 2009 11:33 PM
To: Leonid Keller; Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad
traffic
Leo,
This patch, which I believe was committed as svn.4275,
works fine if there are no IOUnits in the fabric. Once there is an IOU
present, my case a Linux SRP target, this patch hangs HCA disable for a
debug version of ibbus.
[AL]bus_release_resources(): Releasing BusFilter bfi-0
[AL]:al_cleanup(): Destroying \ device.
[AL]:al_cleanup(): Destroying AL Mgr.
[AL]sync_destroy_obj() !ERROR!: Error waiting for
references to be released - delaying.
[AL]print_al_obj() !ERROR!: AL object
0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt:
3
*** Assertion failed: cl_status == CL_SUCCESS
*** Source File:
f:\openib-windows-svn\latest\gen1\trunk\core\al\al_common.c, line 554
Break repeatedly, break Once, Ignore, terminate Process,
or terminate Thread (boipt)? i
i
[AL]sync_destroy_obj() !ERROR!: Forcing object
destruction.
[AL]print_al_obj() !ERROR!: AL object
0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt:
3
[AL]print_al_obj() !ERROR!: AL object
0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200
ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object
00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270
ref_cnt: 2
[AL]print_al_obj() !ERROR!: AL object
0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200
ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object
00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270
ref_cnt: 2
[AL]:al_cleanup(): Destroying async obj mgr.
[AL]:al_cleanup(): Destroying async pnp mgr.
[AL]:al_cleanup(): Destroying async proc mgr.
[AL]:al_cleanup(): Goodbye Cruel World =(
[AL]bus_release_resources() ]
Signaled to stop polling.
Polling thread terminated.
It seems there is a path in IBAL which is not releasing
the reference on the IOC PnP service when an IOU is present in the
fabric.
Perhaps you could suggest a fix?
If commit svn.4275 is removed the call to al_cleanup()
returns successfully with no errors.
thanks,
Stan.
________________________________
From: Leonid Keller [mailto:leonid at mellanox.co.il]
Sent: Monday, April 27, 2009 5:38 AM
To: Leonid Keller; Fab Tillier; Smith, Stan
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad
traffic
Here is a possible explanation and a fix. Please,
review.
__ioc_query_sa takes references on IOC PnP service
before sending the node and path_record requests.
But these references get released at the end of
__node_rec_cb and __path_rec_cb, while __process_sweep routine, which
performs the IOU sweeping, is just scheduled to run in an async thread.
If the test happens to unload the driver after
__node_rec_cb and __path_rec_cb and before __process_sweep started to
run, IOC PnP service gets released and __process_sweep crashes.
The patch takes a reference on IOC PnP service before
scheduling a thread for __process_sweep and releases the reference at
the end of __process_sweep.
(Pay attention, that __process_sweep schedules a thread
for itself twice while moving through its FSM:
SWEEP_IOU_INFO --> SWEEP_IOC_PROFILE -->
SWEEP_SVC_ENTRIES --> SWEEP_COMPLETE)
Index: al/kernel/al_ioc_pnp.c
===================================================================
--- al/kernel/al_ioc_pnp.c (revision 3609)
+++ al/kernel/al_ioc_pnp.c (working copy)
@@ -2231,8 +2231,11 @@
* If this is the last MAD, finish processing the IOU
queries
* in the PnP thread.
*/
- if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) )
+ if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
+ /* Reference the service till the end of processing
in the thread */
+ ref_al_obj( &p_results->p_svc->obj );
cl_async_proc_queue( gp_async_pnp_mgr,
&p_results->async_item );
+ }
AL_EXIT( AL_DBG_PNP );
}
@@ -2354,6 +2357,8 @@
if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt ) )
cl_async_proc_queue( gp_async_pnp_mgr,
&gp_ioc_pnp->async_item );
cl_free( p_results );
+ /* Release the reference taken for the query. */
+ deref_al_obj( &p_results->p_svc->obj );
}
AL_EXIT( AL_DBG_PNP );
________________________________
From: Leonid Keller
Sent: Sunday, April 26, 2009 1:05 AM
To: 'Fab Tillier'; 'Smith, Stan'
Cc: ofw at lists.openfabrics.org
Subject: [ofw] crash on IBBUS disabling while
mad traffic
I've got a crash while running WHQL Disable
Enable test while opensm was running on another node.
I was running a December version of the driver,
but i'm not sure this will work with current one. (i'll try)
The test, which makes disable/enable to all
devices, passes without opensm.
With opensm IBBUS sends SA requests to opensm.
In this case __process_sweep() fails, because
per-port IOC PnP agent seems to be already released.
The latter is strange, because __ioc_query_sa
takes reference on PnP agent before sending request.
__ioc_query_sa
__node_rec_cb
__process_query
__process_sweep
Any ideas ?
3: kd> !analyze -v
ERROR: FindPlugIns 8007007b
************************************************************************
*******
*
*
* Bugcheck Analysis
*
*
*
************************************************************************
*******
DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL (d5)
Memory was referenced after it was freed.
This cannot be protected by try-except.
When possible, the guilty driver's name (Unicode
string) is printed on
the bugcheck screen and saved in
KiBugCheckDriver.
Arguments:
Arg1: fffff98005b72f84, memory referenced
Arg2: 0000000000000000, value 0 = read
operation, 1 = write operation
Arg3: fffffa600400b1d0, if non-zero, the address
which referenced memory.
Arg4: 0000000000000000, (reserved)
Debugging Details:
------------------
Matched: ibbus!proxy_ioctl+0x41
(fffffa60`04031d8d)
Matched: ibbus!proxy_ioctl+0xa5
(fffffa60`04031df1)
READ_ADDRESS: fffff98005b72f84 Special pool
FAULTING_IP:
ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
fffffa60`0400b1d0 83b8d400000003 cmp dword
ptr [rax+0D4h],3
MM_INTERNAL_CODE: 0
IMAGE_NAME: ibbus.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 49401b3e
MODULE_NAME: ibbus
FAULTING_MODULE: fffffa6004002000 ibbus
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
BUGCHECK_STR: 0xD5
PROCESS_NAME: System
CURRENT_IRQL: f
TRAP_FRAME: fffffa6003d50b00 -- (.trap
0xfffffa6003d50b00)
NOTE: The trap frame does not contain all
registers.
Some register values may be zeroed or incorrect.
rax=fffff98005b72eb0 rbx=0000000000000000
rcx=fffffa6004057780
rdx=fffffa6004005e97 rsi=fffffa600199ccc0
rdi=fffff80001cc0304
rip=fffffa600400b1d0 rsp=fffffa6003d50c90
rbp=0000000000000080
r8=0000000000000005 r9=fffffa6004005e97
r10=0000000000000001
r11=fffffa6003d50c50 r12=0000000000000000
r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
ibbus!__process_sweep+0x44:
fffffa60`0400b1d0 83b8d400000003 cmp dword
ptr [rax+0D4h],3 ds:fffff980`05b72f84=????????
Resetting default scope
LAST_CONTROL_TRANSFER: from fffff80001969c42 to
fffff800018b0b30
STACK_TEXT:
fffffa60`03d502f8 fffff800`01969c42 :
fffffa80`0e0eb290 fffff800`0194893d fffff800`01a55140 00000000`00001000
: nt!RtlpBreakWithStatusInstruction
fffffa60`03d50300 fffff800`0196adb7 :
fffff800`00000004 fffff800`01a55140 ffffffff`fffff000 00000000`00000050
: nt!KiBugCheckDebugBreak+0x12
fffffa60`03d50360 fffff800`018b6754 :
fffffa80`0dd77480 fffff800`01cc2bb9 00000000`00000000 fffff800`0194c13f
: nt!KeBugCheck2+0xaa7
fffffa60`03d509d0 fffff800`018c5671 :
00000000`00000050 fffff980`05b72f84 00000000`00000000 fffffa60`03d50b00
: nt!KeBugCheckEx+0x104
fffffa60`03d50a10 fffff800`018b51d9 :
00000000`00000000 fffff980`0427cf78 fffffa80`0e0ecf00 fffff980`1c27ef40
: nt!MmAccessFault+0x1371
fffffa60`03d50b00 fffffa60`0400b1d0 :
fffff980`1c27ef40 fffff980`04318e00 fffffa60`04005eba fffff980`04318e78
: nt!KiPageFault+0x119
fffffa60`03d50c90 fffffa60`04005e9d :
fffff980`04318e98 fffff980`043bccb0 fffff980`1b88afd0 fffff980`04318e78
: ibbus!__process_sweep+0x44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
fffffa60`03d50cc0 fffffa60`040070d9 :
fffff980`04318d60 fffff980`0434afd0 00000000`00000000 fffffa60`0400743c
: ibbus!__cl_async_proc_worker+0x61
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @
153]
fffffa60`03d50cf0 fffffa60`04007464 :
fffff980`0434afd0 00000000`00000080 fffff980`0434afd0 8b8b8b8b`8b8b8b8b
: ibbus!__cl_thread_pool_routine+0x41
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @
66]
fffffa60`03d50d20 fffff800`01adafd3 :
8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b01
: ibbus!__thread_callback+0x28
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c
@ 49]
fffffa60`03d50d50 fffff800`018f0816 :
fffffa60`01999180 fffffa80`0e0eb290 fffffa60`019a2d40 00000000`00000001
: nt!PspSystemThreadStartup+0x57
fffffa60`03d50d80 00000000`00000000 :
00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000
: nt!KiStartSystemThread+0x16
STACK_COMMAND: kb
FOLLOWUP_IP:
ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
fffffa60`0400b1d0 83b8d400000003 cmp dword
ptr [rax+0D4h],3
FAULTING_SOURCE_CODE:
2311:
2312: p_results = PARENT_STRUCT(
p_async_item, ioc_sweep_results_t, async_item );
2313: CL_ASSERT( !p_results->p_svc->query_cnt
);
2314:
> 2315: if( p_results->p_svc->obj.state ==
CL_DESTROYING )
2316: {
2317: __put_iou_map( gp_ioc_pnp,
&p_results->iou_map );
2318: goto err;
2319: }
2320:
SYMBOL_STACK_INDEX: 6
SYMBOL_NAME: ibbus!__process_sweep+44
FOLLOWUP_NAME: MachineOwner
FAILURE_BUCKET_ID:
X64_0xD5_VRF_ibbus!__process_sweep+44
BUCKET_ID:
X64_0xD5_VRF_ibbus!__process_sweep+44
Followup: MachineOwner
---------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090521/b794fb7d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sweep2.patch
Type: application/octet-stream
Size: 2403 bytes
Desc: sweep2.patch
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090521/b794fb7d/attachment.obj>
More information about the ofw
mailing list