[ofw] crash on IBBUS disabling while mad traffic

Leonid Keller leonid at mellanox.co.il
Mon Apr 27 05:38:18 PDT 2009


Here is a possible explanation and a fix. Please, review.
 
__ioc_query_sa takes references on IOC PnP service before sending the
node and path_record requests.
But these references get released at the end of __node_rec_cb and
__path_rec_cb, while __process_sweep routine, which performs the IOU
sweeping, is just scheduled to run in an async thread.
If the test happens to unload the driver after __node_rec_cb and
__path_rec_cb and before __process_sweep started to run, IOC PnP service
gets released and __process_sweep crashes.
 
The patch takes a reference on IOC PnP service before scheduling a
thread for __process_sweep and releases the reference at the end of
__process_sweep.
(Pay attention, that __process_sweep schedules a thread for itself twice
while moving through its FSM: 
SWEEP_IOU_INFO --> SWEEP_IOC_PROFILE --> SWEEP_SVC_ENTRIES -->
SWEEP_COMPLETE)
 
Index: al/kernel/al_ioc_pnp.c
===================================================================
--- al/kernel/al_ioc_pnp.c (revision 3609)
+++ al/kernel/al_ioc_pnp.c (working copy)
@@ -2231,8 +2231,11 @@
   * If this is the last MAD, finish processing the IOU queries
   * in the PnP thread.
   */
- if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) )
+ if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {
+  /* Reference the service till the end of processing in the thread */
+  ref_al_obj( &p_results->p_svc->obj );
   cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item );
+ }
 
  AL_EXIT( AL_DBG_PNP );
 }
@@ -2354,6 +2357,8 @@
   if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt ) )
    cl_async_proc_queue( gp_async_pnp_mgr, &gp_ioc_pnp->async_item );
   cl_free( p_results );
+  /* Release the reference taken for the query. */
+  deref_al_obj( &p_results->p_svc->obj );
  }
 
  AL_EXIT( AL_DBG_PNP );

 


________________________________

	From: Leonid Keller 
	Sent: Sunday, April 26, 2009 1:05 AM
	To: 'Fab Tillier'; 'Smith, Stan'
	Cc: ofw at lists.openfabrics.org
	Subject: [ofw] crash on IBBUS disabling while mad traffic
	
	
	I've got a crash while running WHQL Disable Enable test while
opensm was running on another node.
	I was running a December version of the driver, but i'm not sure
this will work with current one. (i'll try)
	 
	The test, which makes disable/enable to all devices, passes
without opensm.
	With opensm IBBUS sends SA requests to opensm.
	In this case __process_sweep() fails, because per-port IOC PnP
agent seems to be already released.
	The latter is strange, because __ioc_query_sa takes reference on
PnP agent before sending request.
	   __ioc_query_sa
	    __node_rec_cb
	     __process_query
	      __process_sweep
	
	Any ideas ?
	 
	 
	3: kd> !analyze -v
	ERROR: FindPlugIns 8007007b
	
************************************************************************
*******
	*
*
	*                        Bugcheck Analysis
*
	*
*
	
************************************************************************
*******
	 
	DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL (d5)
	Memory was referenced after it was freed.
	This cannot be protected by try-except.
	When possible, the guilty driver's name (Unicode string) is
printed on
	the bugcheck screen and saved in KiBugCheckDriver.
	Arguments:
	Arg1: fffff98005b72f84, memory referenced
	Arg2: 0000000000000000, value 0 = read operation, 1 = write
operation
	Arg3: fffffa600400b1d0, if non-zero, the address which
referenced memory.
	Arg4: 0000000000000000, (reserved)
	 
	Debugging Details:
	------------------
	 
	Matched: ibbus!proxy_ioctl+0x41 (fffffa60`04031d8d) 
	Matched: ibbus!proxy_ioctl+0xa5 (fffffa60`04031df1) 
	 
	READ_ADDRESS:  fffff98005b72f84 Special pool
	 
	FAULTING_IP: 
	ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
	fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr [rax+0D4h],3
	 
	MM_INTERNAL_CODE:  0
	 
	IMAGE_NAME:  ibbus.sys
	 
	DEBUG_FLR_IMAGE_TIMESTAMP:  49401b3e
	 
	MODULE_NAME: ibbus
	 
	FAULTING_MODULE: fffffa6004002000 ibbus
	 
	DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
	 
	BUGCHECK_STR:  0xD5
	 
	PROCESS_NAME:  System
	 
	CURRENT_IRQL:  f
	 
	TRAP_FRAME:  fffffa6003d50b00 -- (.trap 0xfffffa6003d50b00)
	NOTE: The trap frame does not contain all registers.
	Some register values may be zeroed or incorrect.
	rax=fffff98005b72eb0 rbx=0000000000000000 rcx=fffffa6004057780
	rdx=fffffa6004005e97 rsi=fffffa600199ccc0 rdi=fffff80001cc0304
	rip=fffffa600400b1d0 rsp=fffffa6003d50c90 rbp=0000000000000080
	 r8=0000000000000005  r9=fffffa6004005e97 r10=0000000000000001
	r11=fffffa6003d50c50 r12=0000000000000000 r13=0000000000000000
	r14=0000000000000000 r15=0000000000000000
	iopl=0         nv up ei pl zr na po nc
	ibbus!__process_sweep+0x44:
	fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr [rax+0D4h],3
ds:fffff980`05b72f84=????????
	Resetting default scope
	 
	LAST_CONTROL_TRANSFER:  from fffff80001969c42 to
fffff800018b0b30
	 
	STACK_TEXT:  
	fffffa60`03d502f8 fffff800`01969c42 : fffffa80`0e0eb290
fffff800`0194893d fffff800`01a55140 00000000`00001000 :
nt!RtlpBreakWithStatusInstruction
	fffffa60`03d50300 fffff800`0196adb7 : fffff800`00000004
fffff800`01a55140 ffffffff`fffff000 00000000`00000050 :
nt!KiBugCheckDebugBreak+0x12
	fffffa60`03d50360 fffff800`018b6754 : fffffa80`0dd77480
fffff800`01cc2bb9 00000000`00000000 fffff800`0194c13f :
nt!KeBugCheck2+0xaa7
	fffffa60`03d509d0 fffff800`018c5671 : 00000000`00000050
fffff980`05b72f84 00000000`00000000 fffffa60`03d50b00 :
nt!KeBugCheckEx+0x104
	fffffa60`03d50a10 fffff800`018b51d9 : 00000000`00000000
fffff980`0427cf78 fffffa80`0e0ecf00 fffff980`1c27ef40 :
nt!MmAccessFault+0x1371
	fffffa60`03d50b00 fffffa60`0400b1d0 : fffff980`1c27ef40
fffff980`04318e00 fffffa60`04005eba fffff980`04318e78 :
nt!KiPageFault+0x119
	fffffa60`03d50c90 fffffa60`04005e9d : fffff980`04318e98
fffff980`043bccb0 fffff980`1b88afd0 fffff980`04318e78 :
ibbus!__process_sweep+0x44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
	fffffa60`03d50cc0 fffffa60`040070d9 : fffff980`04318d60
fffff980`0434afd0 00000000`00000000 fffffa60`0400743c :
ibbus!__cl_async_proc_worker+0x61
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @
153]
	fffffa60`03d50cf0 fffffa60`04007464 : fffff980`0434afd0
00000000`00000080 fffff980`0434afd0 8b8b8b8b`8b8b8b8b :
ibbus!__cl_thread_pool_routine+0x41
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @
66]
	fffffa60`03d50d20 fffff800`01adafd3 : 8b8b8b8b`8b8b8b8b
8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b01 :
ibbus!__thread_callback+0x28
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c
@ 49]
	fffffa60`03d50d50 fffff800`018f0816 : fffffa60`01999180
fffffa80`0e0eb290 fffffa60`019a2d40 00000000`00000001 :
nt!PspSystemThreadStartup+0x57
	fffffa60`03d50d80 00000000`00000000 : 00000000`00000000
00000000`00000000 00000000`00000000 00000000`00000000 :
nt!KiStartSystemThread+0x16
	 
	

	STACK_COMMAND:  kb
	 
	FOLLOWUP_IP: 
	ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]
	fffffa60`0400b1d0 83b8d400000003  cmp     dword ptr [rax+0D4h],3
	 
	FAULTING_SOURCE_CODE:  
	  2311: 
	  2312:  p_results = PARENT_STRUCT( p_async_item,
ioc_sweep_results_t, async_item );
	  2313:  CL_ASSERT( !p_results->p_svc->query_cnt );
	  2314: 
	> 2315:  if( p_results->p_svc->obj.state == CL_DESTROYING )
	  2316:  {
	  2317:   __put_iou_map( gp_ioc_pnp, &p_results->iou_map );
	  2318:   goto err;
	  2319:  }
	  2320: 
	 

	SYMBOL_STACK_INDEX:  6
	 
	SYMBOL_NAME:  ibbus!__process_sweep+44
	 
	FOLLOWUP_NAME:  MachineOwner
	 
	FAILURE_BUCKET_ID:  X64_0xD5_VRF_ibbus!__process_sweep+44
	 
	BUCKET_ID:  X64_0xD5_VRF_ibbus!__process_sweep+44
	 
	Followup: MachineOwner
	---------
	 
	 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090427/410d57fe/attachment.html>


More information about the ofw mailing list