[ofw] crash on IBBUS disabling while mad traffic

Leonid Keller leonid at mellanox.co.il
Tue May 19 06:41:49 PDT 2009


Hi Stan, 
 
Thank you for the info.
Unfortunately, I don't have now a setup with IOU devices and can't
investigate it.
Maybe you will have a possibility to do a check for me.
 
To remind: my patch was very simple (only 3 lines): 
    i increment ref_cnt of the sweeping thread before running it and
decrement it at the end of the sweep handling.
Your data show, that this ref_cnt is not zero.
So either it can be incremented twice in a row or the thread can exit
without decrementing ref_cnt.
I don't see how it can happen.
 
I'd like you to apply the below patch, make two runs - without and with
IOUs - and send me the debug output of both.
TIA
 
Index: core/al/kernel/al_ioc_pnp.c
===================================================================
--- core/al/kernel/al_ioc_pnp.c (revision 2162)
+++ core/al/kernel/al_ioc_pnp.c (working copy)
@@ -2036,6 +2036,8 @@
    {
     /* Reference the service till the end of processing in the thread
*/
     ref_al_obj( &p_results->p_svc->obj );
+    cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt
%d", 
+     KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc,
p_results->p_svc->obj.ref_cnt);
     cl_async_proc_queue( gp_async_pnp_mgr,
      &p_results->async_item );
    }
@@ -2234,6 +2236,8 @@
  if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
   /* Reference the service till the end of processing in the thread */
   ref_al_obj( &p_results->p_svc->obj );
+  cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d", 
+   KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc,
p_results->p_svc->obj.ref_cnt);
   cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item );
  }
 
@@ -2358,6 +2362,8 @@
    cl_async_proc_queue( gp_async_pnp_mgr, &gp_ioc_pnp->async_item );
   /* Release the reference taken for the query. */
   deref_al_obj( &p_results->p_svc->obj );
+  cl_dbg_out ("~%d:[IBBUS] %s() : p_results %p, p_svc %p, ref_cnt %d", 
+   KeGetCurrentProcessorNumber(), __FUNCTION__, p_results, p_svc,
p_results->p_svc->obj.ref_cnt);
   cl_free( p_results );
  }
 

 
 


________________________________

	From: Smith, Stan [mailto:stan.smith at intel.com] 
	Sent: Monday, May 18, 2009 11:33 PM
	To: Leonid Keller; Fab Tillier
	Cc: ofw at lists.openfabrics.org
	Subject: RE: [ofw] crash on IBBUS disabling while mad traffic
	
	
	Leo,
	  This patch, which I believe was committed as svn.4275, works
fine if there are no IOUnits in the fabric. Once there is an IOU
present, my case a Linux SRP target, this patch hangs HCA disable for a
debug version of ibbus. 
	 
	[AL]bus_release_resources(): Releasing BusFilter bfi-0
	[AL]:al_cleanup(): Destroying \ device.
	[AL]:al_cleanup(): Destroying AL Mgr.
	[AL]sync_destroy_obj() !ERROR!: Error waiting for references to
be released - delaying.
	[AL]print_al_obj() !ERROR!: AL object
0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt:
3
	 
	*** Assertion failed: cl_status == CL_SUCCESS
	***   Source File:
f:\openib-windows-svn\latest\gen1\trunk\core\al\al_common.c, line 554
	 
	Break repeatedly, break Once, Ignore, terminate Process, or
terminate Thread (boipt)? i
	i
	[AL]sync_destroy_obj() !ERROR!: Forcing object destruction.
	[AL]print_al_obj() !ERROR!: AL object
0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt:
3
	[AL]print_al_obj() !ERROR!: AL object
0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200
ref_cnt: 1
	[AL]print_al_obj() !ERROR!: AL object
00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270
ref_cnt: 2
	[AL]print_al_obj() !ERROR!: AL object
0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200
ref_cnt: 1
	[AL]print_al_obj() !ERROR!: AL object
00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270
ref_cnt: 2
	[AL]:al_cleanup(): Destroying async obj mgr.
	[AL]:al_cleanup(): Destroying async pnp mgr.
	[AL]:al_cleanup(): Destroying async proc mgr.
	[AL]:al_cleanup(): Goodbye Cruel World =(
	[AL]bus_release_resources() ]
	Signaled to stop polling.
	Polling thread terminated.
	
	It seems there is a path in IBAL which is not releasing the
reference on the IOC PnP service when an IOU is present in the fabric.
	Perhaps you could suggest a fix?
	If commit svn.4275 is removed the call to al_cleanup() returns
successfully with no errors.
	 
	thanks,
	 
	Stan.
	
	
________________________________

	From: Leonid Keller [mailto:leonid at mellanox.co.il] 
	Sent: Monday, April 27, 2009 5:38 AM
	To: Leonid Keller; Fab Tillier; Smith, Stan
	Cc: ofw at lists.openfabrics.org
	Subject: RE: [ofw] crash on IBBUS disabling while mad traffic
	
	
	Here is a possible explanation and a fix. Please, review.
	 
	__ioc_query_sa takes references on IOC PnP service before
sending the node and path_record requests.
	But these references get released at the end of __node_rec_cb
and __path_rec_cb, while __process_sweep routine, which performs the IOU
sweeping, is just scheduled to run in an async thread.
	If the test happens to unload the driver after __node_rec_cb and
__path_rec_cb and before __process_sweep started to run, IOC PnP service
gets released and __process_sweep crashes.
	 
	The patch takes a reference on IOC PnP service before scheduling
a thread for __process_sweep and releases the reference at the end of
__process_sweep.
	(Pay attention, that __process_sweep schedules a thread for
itself twice while moving through its FSM: 
	SWEEP_IOU_INFO --> SWEEP_IOC_PROFILE --> SWEEP_SVC_ENTRIES -->
SWEEP_COMPLETE)
	 
	Index: al/kernel/al_ioc_pnp.c
	
===================================================================
	--- al/kernel/al_ioc_pnp.c (revision 3609)
	+++ al/kernel/al_ioc_pnp.c (working copy)
	@@ -2231,8 +2231,11 @@
	   * If this is the last MAD, finish processing the IOU queries
	   * in the PnP thread.
	   */
	- if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) )
	+ if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
	+  /* Reference the service till the end of processing in the
thread */
	+  ref_al_obj( &p_results->p_svc->obj );
	   cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item
);
	+ }
	 
	  AL_EXIT( AL_DBG_PNP );
	 }
	@@ -2354,6 +2357,8 @@
	   if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt ) )
	    cl_async_proc_queue( gp_async_pnp_mgr,
&gp_ioc_pnp->async_item );
	   cl_free( p_results );
	+  /* Release the reference taken for the query. */
	+  deref_al_obj( &p_results->p_svc->obj );
	  }
	 
	  AL_EXIT( AL_DBG_PNP );
	
	 


________________________________

		From: Leonid Keller 
		Sent: Sunday, April 26, 2009 1:05 AM
		To: 'Fab Tillier'; 'Smith, Stan'
		Cc: ofw at lists.openfabrics.org
		Subject: [ofw] crash on IBBUS disabling while mad
traffic
		
		
		I've got a crash while running WHQL Disable Enable test
while opensm was running on another node.
		I was running a December version of the driver, but i'm
not sure this will work with current one. (i'll try)
		 
		The test, which makes disable/enable to all devices,
passes without opensm.
		With opensm IBBUS sends SA requests to opensm.
		In this case __process_sweep() fails, because per-port
IOC PnP agent seems to be already released.
		The latter is strange, because __ioc_query_sa takes
reference on PnP agent before sending request.
		   __ioc_query_sa
		    __node_rec_cb
		     __process_query
		      __process_sweep
		
		Any ideas ?
		 
		 
		3: kd> !analyze -v
		ERROR: FindPlugIns 8007007b
	
************************************************************************
*******
		*
*
		*                        Bugcheck Analysis
*
		*
*
	
************************************************************************
*******
		 
		DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL (d5)
		Memory was referenced after it was freed.
		This cannot be protected by try-except.
		When possible, the guilty driver's name (Unicode string)
is printed on
		the bugcheck screen and saved in KiBugCheckDriver.
		Arguments:
		Arg1: fffff98005b72f84, memory referenced
		Arg2: 0000000000000000, value 0 = read operation, 1 =
write operation
		Arg3: fffffa600400b1d0, if non-zero, the address which
referenced memory.
		Arg4: 0000000000000000, (reserved)
		 
		Debugging Details:
		------------------
		 
		Matched: ibbus!proxy_ioctl+0x41 (fffffa60`04031d8d) 
		Matched: ibbus!proxy_ioctl+0xa5 (fffffa60`04031df1) 
		 
		READ_ADDRESS:  fffff98005b72f84 Special pool
		 
		FAULTING_IP: 
		ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
		fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr
[rax+0D4h],3
		 
		MM_INTERNAL_CODE:  0
		 
		IMAGE_NAME:  ibbus.sys
		 
		DEBUG_FLR_IMAGE_TIMESTAMP:  49401b3e
		 
		MODULE_NAME: ibbus
		 
		FAULTING_MODULE: fffffa6004002000 ibbus
		 
		DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
		 
		BUGCHECK_STR:  0xD5
		 
		PROCESS_NAME:  System
		 
		CURRENT_IRQL:  f
		 
		TRAP_FRAME:  fffffa6003d50b00 -- (.trap
0xfffffa6003d50b00)
		NOTE: The trap frame does not contain all registers.
		Some register values may be zeroed or incorrect.
		rax=fffff98005b72eb0 rbx=0000000000000000
rcx=fffffa6004057780
		rdx=fffffa6004005e97 rsi=fffffa600199ccc0
rdi=fffff80001cc0304
		rip=fffffa600400b1d0 rsp=fffffa6003d50c90
rbp=0000000000000080
		 r8=0000000000000005  r9=fffffa6004005e97
r10=0000000000000001
		r11=fffffa6003d50c50 r12=0000000000000000
r13=0000000000000000
		r14=0000000000000000 r15=0000000000000000
		iopl=0         nv up ei pl zr na po nc
		ibbus!__process_sweep+0x44:
		fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr
[rax+0D4h],3 ds:fffff980`05b72f84=????????
		Resetting default scope
		 
		LAST_CONTROL_TRANSFER:  from fffff80001969c42 to
fffff800018b0b30
		 
		STACK_TEXT:  
		fffffa60`03d502f8 fffff800`01969c42 : fffffa80`0e0eb290
fffff800`0194893d fffff800`01a55140 00000000`00001000 :
nt!RtlpBreakWithStatusInstruction
		fffffa60`03d50300 fffff800`0196adb7 : fffff800`00000004
fffff800`01a55140 ffffffff`fffff000 00000000`00000050 :
nt!KiBugCheckDebugBreak+0x12
		fffffa60`03d50360 fffff800`018b6754 : fffffa80`0dd77480
fffff800`01cc2bb9 00000000`00000000 fffff800`0194c13f :
nt!KeBugCheck2+0xaa7
		fffffa60`03d509d0 fffff800`018c5671 : 00000000`00000050
fffff980`05b72f84 00000000`00000000 fffffa60`03d50b00 :
nt!KeBugCheckEx+0x104
		fffffa60`03d50a10 fffff800`018b51d9 : 00000000`00000000
fffff980`0427cf78 fffffa80`0e0ecf00 fffff980`1c27ef40 :
nt!MmAccessFault+0x1371
		fffffa60`03d50b00 fffffa60`0400b1d0 : fffff980`1c27ef40
fffff980`04318e00 fffffa60`04005eba fffff980`04318e78 :
nt!KiPageFault+0x119
		fffffa60`03d50c90 fffffa60`04005e9d : fffff980`04318e98
fffff980`043bccb0 fffff980`1b88afd0 fffff980`04318e78 :
ibbus!__process_sweep+0x44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
		fffffa60`03d50cc0 fffffa60`040070d9 : fffff980`04318d60
fffff980`0434afd0 00000000`00000000 fffffa60`0400743c :
ibbus!__cl_async_proc_worker+0x61
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @
153]
		fffffa60`03d50cf0 fffffa60`04007464 : fffff980`0434afd0
00000000`00000080 fffff980`0434afd0 8b8b8b8b`8b8b8b8b :
ibbus!__cl_thread_pool_routine+0x41
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @
66]
		fffffa60`03d50d20 fffff800`01adafd3 : 8b8b8b8b`8b8b8b8b
8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b01 :
ibbus!__thread_callback+0x28
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c
@ 49]
		fffffa60`03d50d50 fffff800`018f0816 : fffffa60`01999180
fffffa80`0e0eb290 fffffa60`019a2d40 00000000`00000001 :
nt!PspSystemThreadStartup+0x57
		fffffa60`03d50d80 00000000`00000000 : 00000000`00000000
00000000`00000000 00000000`00000000 00000000`00000000 :
nt!KiStartSystemThread+0x16
		 
		

		STACK_COMMAND:  kb
		 
		FOLLOWUP_IP: 
		ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
		fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr
[rax+0D4h],3
		 
		FAULTING_SOURCE_CODE:  
		  2311: 
		  2312:  p_results = PARENT_STRUCT( p_async_item,
ioc_sweep_results_t, async_item );
		  2313:  CL_ASSERT( !p_results->p_svc->query_cnt );
		  2314: 
		> 2315:  if( p_results->p_svc->obj.state ==
CL_DESTROYING )
		  2316:  {
		  2317:   __put_iou_map( gp_ioc_pnp, &p_results->iou_map
);
		  2318:   goto err;
		  2319:  }
		  2320: 
		 

		SYMBOL_STACK_INDEX:  6
		 
		SYMBOL_NAME:  ibbus!__process_sweep+44
		 
		FOLLOWUP_NAME:  MachineOwner
		 
		FAILURE_BUCKET_ID:
X64_0xD5_VRF_ibbus!__process_sweep+44
		 
		BUCKET_ID:  X64_0xD5_VRF_ibbus!__process_sweep+44
		 
		Followup: MachineOwner
		---------
		 
		 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090519/805381f3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sweep1.patch
Type: application/octet-stream
Size: 1460 bytes
Desc: sweep1.patch
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090519/805381f3/attachment.obj>


More information about the ofw mailing list