[ofw] crash on IBBUS disabling while mad traffic

Smith, Stan stan.smith at intel.com
Thu May 21 10:55:48 PDT 2009


This patch works as expected!
When SRP target is present disable/enable cycles work correctly.
When SRP target is not present disable/enable cycles work correctly and pass the WDT enable_disable test.

Please commit ASAP.

stan.

________________________________
From: Leonid Keller [mailto:leonid at mellanox.co.il]
Sent: Thursday, May 21, 2009 3:06 AM
To: Smith, Stan; Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad traffic

Hi Stan,

Thanks a lot on the debug info.
I believe, i understood the problem.
The algorithm of sweeping passes several times through the point, where I added the increment of ref_cnt (and not once, as i thought).

Find below and attached a new patch, that solves the original problem a bit simpler.
It takes advance of the fact, that all the sweep processing is managed by one ioc_sweep_results_t structrure.
The patch increments ref_cnt only once, just after allocating a structure, and decrements it only once, just before releasing the structure.
It should work unless this structure get leaked in some scenario.

I'd very thankful to you, if you could check it. (i still don't have SRP environment).
Please, pay attention, that it is a patch for the original (commited to SVN) patch and not for one, i sent you in the previous e-mail.
TIA

Index: core/al/kernel/al_ioc_pnp.c
===================================================================
--- core/al/kernel/al_ioc_pnp.c (revision 2191)
+++ core/al/kernel/al_ioc_pnp.c (working copy)
@@ -1720,6 +1720,11 @@
   p_results->p_svc = p_svc;
   cl_fmap_init( &p_results->iou_map, __iou_cmp );

+  /* Reference the service till the end of sweep processing */
+  ref_al_obj( &p_results->p_svc->obj );
+  cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+   KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc, p_results->p_svc->obj.ref_cnt);
+
   /* Build the map of nodes by port GUID. */
   __process_nodes( p_svc, &port_map );

@@ -1749,6 +1754,10 @@
   break;
  default:
   CL_ASSERT( p_results );
+  /* Release the reference taken for the sweep. */
+  deref_al_obj( &p_results->p_svc->obj );
+  cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+   KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc, p_results->p_svc->obj.ref_cnt);
   cl_free( p_results );
   /* Fall through */
  case IB_INSUFFICIENT_MEMORY:
@@ -2034,8 +2043,6 @@
    if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) &&
     status == IB_SUCCESS )
    {
-    /* Reference the service till the end of processing in the thread */
-    ref_al_obj( &p_results->p_svc->obj );
     cl_async_proc_queue( gp_async_pnp_mgr,
      &p_results->async_item );
    }
@@ -2231,11 +2238,8 @@
   * If this is the last MAD, finish processing the IOU queries
   * in the PnP thread.
   */
- if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
-  /* Reference the service till the end of processing in the thread */
-  ref_al_obj( &p_results->p_svc->obj );
+ if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) )
   cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item );
- }

  AL_EXIT( AL_DBG_PNP );
 }
@@ -2356,8 +2360,10 @@
 err:
   if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt ) )
    cl_async_proc_queue( gp_async_pnp_mgr, &gp_ioc_pnp->async_item );
-  /* Release the reference taken for the query. */
+  /* Release the reference taken for the sweep. */
   deref_al_obj( &p_results->p_svc->obj );
+  cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+   KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc, p_results->p_svc->obj.ref_cnt);
   cl_free( p_results );
  }


________________________________
From: Smith, Stan [mailto:stan.smith at intel.com]
Sent: Wednesday, May 20, 2009 10:15 PM
To: Leonid Keller; Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad traffic

Hello Leo,
  Here's the requested debug info, one HCA enable/disable cycle with No IOUs available and one cycle with 2 IOUs available.

stan.


No IOUs available -  No HCAs enabled
Enable HCA0
Before polling.
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 82136008, p_svc 825F47E8, ref_cnt 3
~1:[IBBUS]  __process_sweep()    : p_results 82136008, p_svc 825F47E8, ref_cnt 2
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 82145840, p_svc 825F47E8, ref_cnt 3
~1:[IBBUS]  __process_sweep()    : p_results 82145840, p_svc 825F47E8, ref_cnt 2
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 82145840, p_svc 825F47E8, ref_cnt 3
~1:[IBBUS]  __process_sweep()    : p_results 82145840, p_svc 825F47E8, ref_cnt 2

Disable HCA0
[AL]:al_cleanup(): Goodbye Cruel World =(
Signaled to stop polling.
Polling thread terminated.

<Enabled SRP Linux target[2 IOUs]>
<restart opensm on the SRP target node>

<reboot test system>

2 IOUs available - No HCAs enabled
Enable HCA0
Before polling.
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 820F3280, p_svc 820360B8, ref_cnt 3
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 820F3280, p_svc 820360B8, ref_cnt 4
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 820F3280, p_svc 820360B8, ref_cnt 5
~1:[IBBUS]  __process_sweep()    : p_results 820F3280, p_svc 820360B8,   ref_cnt 4
Cancelled Found New Hardware Wizard: SRP driver load - same hang results if SRP driver pre-installed.
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 820F3280, p_svc 820360B8, ref_cnt 6
~0:[IBBUS]  __ioc_pnp_send_cb() : p_results 820F3280, p_svc 820360B8, ref_cnt 7
~1:[IBBUS]  __process_sweep()    : p_results 820F3280, p_svc 820360B8,   ref_cnt 6

Disable HCA0
[AL]sync_destroy_obj() !ERROR!: Error waiting for references to be released - delaying.
[AL]print_al_obj() !ERROR!: AL object 0000000082096008(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt: 3

*** Assertion failed: cl_status == CL_SUCCESS
***   Source File: f:\openib-windows-svn\latest\gen1\trunk\core\al\al_common.c, line 539

Break repeatedly, break Once, Ignore, terminate Process, or terminate Thread (boipt)? i
i
[AL]sync_destroy_obj() !ERROR!: Forcing object destruction.
[AL]print_al_obj() !ERROR!: AL object 0000000082096008(AL_OBJ_TYPE_AL_MGR),      parent: 0000000000000000 ref_cnt: 3
[AL]print_al_obj() !ERROR!: AL object 00000000822103c0(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082096008 ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object 00000000820360b8(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 00000000822103c0 ref_cnt: 4
[AL]print_al_obj() !ERROR!: AL object 00000000822103c0(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082096008 ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object 00000000820360b8(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 00000000822103c0 ref_cnt: 4
[AL]:al_cleanup(): Goodbye Cruel World =(
Signaled to stop polling.
Polling thread terminated.


________________________________
From: Leonid Keller [mailto:leonid at mellanox.co.il]
Sent: Tuesday, May 19, 2009 6:42 AM
To: Smith, Stan; Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad traffic

Hi Stan,

Thank you for the info.
Unfortunately, I don't have now a setup with IOU devices and can't investigate it.
Maybe you will have a possibility to do a check for me.

To remind: my patch was very simple (only 3 lines):
    i increment ref_cnt of the sweeping thread before running it and decrement it at the end of the sweep handling.
Your data show, that this ref_cnt is not zero.
So either it can be incremented twice in a row or the thread can exit without decrementing ref_cnt.
I don't see how it can happen.

I'd like you to apply the below patch, make two runs - without and with IOUs - and send me the debug output of both.
TIA

Index: core/al/kernel/al_ioc_pnp.c
===================================================================
--- core/al/kernel/al_ioc_pnp.c (revision 2162)
+++ core/al/kernel/al_ioc_pnp.c (working copy)
@@ -2036,6 +2036,8 @@
    {
     /* Reference the service till the end of processing in the thread */
     ref_al_obj( &p_results->p_svc->obj );
+    cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+     KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc, p_results->p_svc->obj.ref_cnt);
     cl_async_proc_queue( gp_async_pnp_mgr,
      &p_results->async_item );
    }
@@ -2234,6 +2236,8 @@
  if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
   /* Reference the service till the end of processing in the thread */
   ref_al_obj( &p_results->p_svc->obj );
+  cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+   KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc, p_results->p_svc->obj.ref_cnt);
   cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item );
  }

@@ -2358,6 +2362,8 @@
    cl_async_proc_queue( gp_async_pnp_mgr, &gp_ioc_pnp->async_item );
   /* Release the reference taken for the query. */
   deref_al_obj( &p_results->p_svc->obj );
+  cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d",
+   KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc, p_results->p_svc->obj.ref_cnt);
   cl_free( p_results );
  }




________________________________
From: Smith, Stan [mailto:stan.smith at intel.com]
Sent: Monday, May 18, 2009 11:33 PM
To: Leonid Keller; Fab Tillier
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad traffic

Leo,
  This patch, which I believe was committed as svn.4275, works fine if there are no IOUnits in the fabric. Once there is an IOU present, my case a Linux SRP target, this patch hangs HCA disable for a debug version of ibbus.

[AL]bus_release_resources(): Releasing BusFilter bfi-0
[AL]:al_cleanup(): Destroying \ device.
[AL]:al_cleanup(): Destroying AL Mgr.
[AL]sync_destroy_obj() !ERROR!: Error waiting for references to be released - delaying.
[AL]print_al_obj() !ERROR!: AL object 0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt: 3

*** Assertion failed: cl_status == CL_SUCCESS
***   Source File: f:\openib-windows-svn\latest\gen1\trunk\core\al\al_common.c, line 554

Break repeatedly, break Once, Ignore, terminate Process, or terminate Thread (boipt)? i
i
[AL]sync_destroy_obj() !ERROR!: Forcing object destruction.
[AL]print_al_obj() !ERROR!: AL object 0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt: 3
[AL]print_al_obj() !ERROR!: AL object 0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200 ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object 00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270 ref_cnt: 2
[AL]print_al_obj() !ERROR!: AL object 0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200 ref_cnt: 1
[AL]print_al_obj() !ERROR!: AL object 00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270 ref_cnt: 2
[AL]:al_cleanup(): Destroying async obj mgr.
[AL]:al_cleanup(): Destroying async pnp mgr.
[AL]:al_cleanup(): Destroying async proc mgr.
[AL]:al_cleanup(): Goodbye Cruel World =(
[AL]bus_release_resources() ]
Signaled to stop polling.
Polling thread terminated.
It seems there is a path in IBAL which is not releasing the reference on the IOC PnP service when an IOU is present in the fabric.
Perhaps you could suggest a fix?
If commit svn.4275 is removed the call to al_cleanup() returns successfully with no errors.

thanks,

Stan.

________________________________
From: Leonid Keller [mailto:leonid at mellanox.co.il]
Sent: Monday, April 27, 2009 5:38 AM
To: Leonid Keller; Fab Tillier; Smith, Stan
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] crash on IBBUS disabling while mad traffic

Here is a possible explanation and a fix. Please, review.

__ioc_query_sa takes references on IOC PnP service before sending the node and path_record requests.
But these references get released at the end of __node_rec_cb and __path_rec_cb, while __process_sweep routine, which performs the IOU sweeping, is just scheduled to run in an async thread.
If the test happens to unload the driver after __node_rec_cb and __path_rec_cb and before __process_sweep started to run, IOC PnP service gets released and __process_sweep crashes.

The patch takes a reference on IOC PnP service before scheduling a thread for __process_sweep and releases the reference at the end of __process_sweep.
(Pay attention, that __process_sweep schedules a thread for itself twice while moving through its FSM:
SWEEP_IOU_INFO --> SWEEP_IOC_PROFILE --> SWEEP_SVC_ENTRIES --> SWEEP_COMPLETE)

Index: al/kernel/al_ioc_pnp.c
===================================================================
--- al/kernel/al_ioc_pnp.c (revision 3609)
+++ al/kernel/al_ioc_pnp.c (working copy)
@@ -2231,8 +2231,11 @@
   * If this is the last MAD, finish processing the IOU queries
   * in the PnP thread.
   */
- if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) )
+ if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
+  /* Reference the service till the end of processing in the thread */
+  ref_al_obj( &p_results->p_svc->obj );
   cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item );
+ }

  AL_EXIT( AL_DBG_PNP );
 }
@@ -2354,6 +2357,8 @@
   if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt ) )
    cl_async_proc_queue( gp_async_pnp_mgr, &gp_ioc_pnp->async_item );
   cl_free( p_results );
+  /* Release the reference taken for the query. */
+  deref_al_obj( &p_results->p_svc->obj );
  }

  AL_EXIT( AL_DBG_PNP );


________________________________
From: Leonid Keller
Sent: Sunday, April 26, 2009 1:05 AM
To: 'Fab Tillier'; 'Smith, Stan'
Cc: ofw at lists.openfabrics.org
Subject: [ofw] crash on IBBUS disabling while mad traffic

I've got a crash while running WHQL Disable Enable test while opensm was running on another node.
I was running a December version of the driver, but i'm not sure this will work with current one. (i'll try)

The test, which makes disable/enable to all devices, passes without opensm.
With opensm IBBUS sends SA requests to opensm.
In this case __process_sweep() fails, because per-port IOC PnP agent seems to be already released.
The latter is strange, because __ioc_query_sa takes reference on PnP agent before sending request.
   __ioc_query_sa
    __node_rec_cb
     __process_query
      __process_sweep
Any ideas ?


3: kd> !analyze -v
ERROR: FindPlugIns 8007007b
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL (d5)
Memory was referenced after it was freed.
This cannot be protected by try-except.
When possible, the guilty driver's name (Unicode string) is printed on
the bugcheck screen and saved in KiBugCheckDriver.
Arguments:
Arg1: fffff98005b72f84, memory referenced
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation
Arg3: fffffa600400b1d0, if non-zero, the address which referenced memory.
Arg4: 0000000000000000, (reserved)

Debugging Details:
------------------

Matched: ibbus!proxy_ioctl+0x41 (fffffa60`04031d8d)
Matched: ibbus!proxy_ioctl+0xa5 (fffffa60`04031df1)

READ_ADDRESS:  fffff98005b72f84 Special pool

FAULTING_IP:
ibbus!__process_sweep+44 [s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @ 2315]
fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr [rax+0D4h],3

MM_INTERNAL_CODE:  0

IMAGE_NAME:  ibbus.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  49401b3e

MODULE_NAME: ibbus

FAULTING_MODULE: fffffa6004002000 ibbus

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0xD5

PROCESS_NAME:  System

CURRENT_IRQL:  f

TRAP_FRAME:  fffffa6003d50b00 -- (.trap 0xfffffa6003d50b00)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffff98005b72eb0 rbx=0000000000000000 rcx=fffffa6004057780
rdx=fffffa6004005e97 rsi=fffffa600199ccc0 rdi=fffff80001cc0304
rip=fffffa600400b1d0 rsp=fffffa6003d50c90 rbp=0000000000000080
 r8=0000000000000005  r9=fffffa6004005e97 r10=0000000000000001
r11=fffffa6003d50c50 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
ibbus!__process_sweep+0x44:
fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr [rax+0D4h],3 ds:fffff980`05b72f84=????????
Resetting default scope

LAST_CONTROL_TRANSFER:  from fffff80001969c42 to fffff800018b0b30

STACK_TEXT:
fffffa60`03d502f8 fffff800`01969c42 : fffffa80`0e0eb290 fffff800`0194893d fffff800`01a55140 00000000`00001000 : nt!RtlpBreakWithStatusInstruction
fffffa60`03d50300 fffff800`0196adb7 : fffff800`00000004 fffff800`01a55140 ffffffff`fffff000 00000000`00000050 : nt!KiBugCheckDebugBreak+0x12
fffffa60`03d50360 fffff800`018b6754 : fffffa80`0dd77480 fffff800`01cc2bb9 00000000`00000000 fffff800`0194c13f : nt!KeBugCheck2+0xaa7
fffffa60`03d509d0 fffff800`018c5671 : 00000000`00000050 fffff980`05b72f84 00000000`00000000 fffffa60`03d50b00 : nt!KeBugCheckEx+0x104
fffffa60`03d50a10 fffff800`018b51d9 : 00000000`00000000 fffff980`0427cf78 fffffa80`0e0ecf00 fffff980`1c27ef40 : nt!MmAccessFault+0x1371
fffffa60`03d50b00 fffffa60`0400b1d0 : fffff980`1c27ef40 fffff980`04318e00 fffffa60`04005eba fffff980`04318e78 : nt!KiPageFault+0x119
fffffa60`03d50c90 fffffa60`04005e9d : fffff980`04318e98 fffff980`043bccb0 fffff980`1b88afd0 fffff980`04318e78 : ibbus!__process_sweep+0x44 [s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @ 2315]
fffffa60`03d50cc0 fffffa60`040070d9 : fffff980`04318d60 fffff980`0434afd0 00000000`00000000 fffffa60`0400743c : ibbus!__cl_async_proc_worker+0x61 [s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @ 153]
fffffa60`03d50cf0 fffffa60`04007464 : fffff980`0434afd0 00000000`00000080 fffff980`0434afd0 8b8b8b8b`8b8b8b8b : ibbus!__cl_thread_pool_routine+0x41 [s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @ 66]
fffffa60`03d50d20 fffff800`01adafd3 : 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b01 : ibbus!__thread_callback+0x28 [s:\builds\3609\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c @ 49]
fffffa60`03d50d50 fffff800`018f0816 : fffffa60`01999180 fffffa80`0e0eb290 fffffa60`019a2d40 00000000`00000001 : nt!PspSystemThreadStartup+0x57
fffffa60`03d50d80 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16


STACK_COMMAND:  kb

FOLLOWUP_IP:
ibbus!__process_sweep+44 [s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @ 2315]
fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr [rax+0D4h],3

FAULTING_SOURCE_CODE:
  2311:
  2312:  p_results = PARENT_STRUCT( p_async_item, ioc_sweep_results_t, async_item );
  2313:  CL_ASSERT( !p_results->p_svc->query_cnt );
  2314:
> 2315:  if( p_results->p_svc->obj.state == CL_DESTROYING )
  2316:  {
  2317:   __put_iou_map( gp_ioc_pnp, &p_results->iou_map );
  2318:   goto err;
  2319:  }
  2320:


SYMBOL_STACK_INDEX:  6

SYMBOL_NAME:  ibbus!__process_sweep+44

FOLLOWUP_NAME:  MachineOwner

FAILURE_BUCKET_ID:  X64_0xD5_VRF_ibbus!__process_sweep+44

BUCKET_ID:  X64_0xD5_VRF_ibbus!__process_sweep+44

Followup: MachineOwner
---------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090521/4d70ff00/attachment.html>


More information about the ofw mailing list