[Openib-windows] srp blue screen when CM fail to connect

Yossi Leybovich sleybo at mellanox.co.il
Sun Sep 10 04:19:22 PDT 2006


Fab
 
We still get the blue screen (even on new installation ) on object
destruction.
The problem is with ib_close_ca .
The ca object is using sync destruction (AL_OBJ_SUBTYPE_UM_EXPORT) but
its destruction still use ib_syn_destroy flag.
 
proxy_close_ca function:

 /*
  * Note that we hold a reference on the CA, so we need to
  * call close_ca, not ib_close_ca.  We also don't release the reference
  * since close_ca will do so (by destroying the object).
  */
 h_ca->obj.pfn_destroy( &h_ca->obj, ib_sync_destroy );
 p_ioctl->out.status = IB_SUCCESS;

The code does not protect sync objects from using the ib_sync_destroy
function and we end up with calling 0xffffffff.
We can fix the call in proxy_close_ca but there are more places that use
ib_sync_destroy function
I think the way to solve this is to check ib_sync_destroy  also in the
sync_destroy_obj  function .
 
This patch fix the problem :
 
Index: W:/work/clean/core/al/al_common.c
===================================================================
--- W:/work/clean/core/al/al_common.c (revision 1666)
+++ W:/work/clean/core/al/al_common.c (revision 1667)
@@ -467,7 +467,7 @@
  AL_ENTER( AL_DBG_AL_OBJ );
 
  if( pfn_destroy_cb == ib_sync_destroy )
-  sync_destroy_obj( p_obj, __sync_destroy_cb );
+  sync_destroy_obj( p_obj, pfn_destroy_cb );
  else if( destroy_obj( p_obj, pfn_destroy_cb ) )
   deref_al_obj( p_obj ); /* Only destroy the object once. */
 
@@ -482,10 +482,12 @@
  IN  const ib_pfn_destroy_cb_t   pfn_destroy_cb )
 {
  cl_status_t  cl_status;
+ ib_pfn_destroy_cb_t  destroy_cb = (pfn_destroy_cb == ib_sync_destroy)
? __sync_destroy_cb :pfn_destroy_cb;
 
  AL_ENTER( AL_DBG_AL_OBJ );
 
- if( !destroy_obj( p_obj, pfn_destroy_cb ) )
+
+ if( !destroy_obj( p_obj, destroy_cb ) )
  {
   /* Object is already being destroyed... */
   AL_EXIT( AL_DBG_AL_OBJ );

 
this is the blue screen call stack:
 

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by
try-except,
it must be protected by a Probe.  Typically the address is just plain
bad or it
is pointing at freed memory.
Arguments:
Arg1: ffffffff, memory referenced.
Arg2: 00000000, value 0 = read operation, 1 = write operation.
Arg3: ffffffff, If non-zero, the instruction address which referenced
the bad memory
 address.
Arg4: 00000000, (reserved)
 
Debugging Details:
------------------
 

READ_ADDRESS:  ffffffff 
 
FAULTING_IP: 
+ffffffffffffffff
ffffffff ??               ???
 
MM_INTERNAL_CODE:  0
 
DEFAULT_BUCKET_ID:  DRIVER_FAULT
 
BUGCHECK_STR:  0x50
 
CURRENT_IRQL:  1
 
LAST_CONTROL_TRANSFER:  from 8087a46f to 80833f96
 
FAILED_INSTRUCTION_ADDRESS: 
+ffffffffffffffff
ffffffff ??               ???
 
STACK_TEXT:  
b959e698 8087a46f 00000003 00000000 ffffffff
nt!RtlpBreakWithStatusInstruction
b959e6e4 8087b236 00000003 808b4120 88d848c8
nt!KiBugCheckDebugBreak+0x19
b959ea7c 8087b6be 00000050 ffffffff 00000000 nt!KeBugCheck2+0x5b2
b959ea9c 808689ee 00000050 ffffffff 00000000 nt!KeBugCheckEx+0x1b
b959eaec 80837d0a 00000000 ffffffff 00000000 nt!MmAccessFault+0x813
b959eaec ffffffff 00000000 ffffffff 00000000 nt!KiTrap0E+0xdc
WARNING: Frame IP not in any known module. Following frames may be
wrong.
b959eb74 b940b35a 00089318 88d86b2c 00000200 0xffffffff
b959eb8c b940b024 88d86b2c 89989d20 899a6cd8 ibbus!async_destroy_cb+0xda
[s:\builds\1660\trunk\core\al\al_common.c @ 675]
b959eba0 b9410b9e 88d86b18 ffffffff ffffffff ibbus!sync_destroy_obj+0xd6
[s:\builds\1660\trunk\core\al\al_common.c @ 548]
b959ebbc b94147de 8895b2c0 b959ebf4 8895b2c0 ibbus!proxy_close_ca+0xca
[s:\builds\1660\trunk\core\al\kernel\al_proxy_verbs.c @ 638]
b959ebd8 b93ee640 8895b2c0 b959ebf4 89cf32d8 ibbus!verbs_ioctl+0x24c
[s:\builds\1660\trunk\core\al\kernel\al_proxy_verbs.c @ 3451]
b959ebf8 b942411e 8895b2c0 80a78be4 89cf3220 ibbus!al_dev_ioctl+0xb2
[s:\builds\1660\trunk\core\al\kernel\al_dev.c @ 455]
b959ec10 809d457d 89cf3304 89cf32ec 8895b2c0 ibbus!bus_drv_ioctl+0x8e
[s:\builds\1660\trunk\core\bus\kernel\bus_driver.c @ 401]
b959ec40 80859657 8092d3b9 b959ec60 8092d3b9 nt!IovCallDriver+0x112
b959ec4c 8092d3b9 8895b354 8987f280 8895b2c0 nt!IofCallDriver+0x13
b959ec60 8092e81b 89cf3220 8895b2c0 8987f280
nt!IopSynchronousServiceTail+0x10b
b959ed00 80940844 00000788 00000000 00000000 nt!IopXxxControlFile+0x5db
b959ed34 80834d3f 00000788 00000000 00000000
nt!NtDeviceIoControlFile+0x2a
b959ed34 7c82ed54 00000788 00000000 00000000 nt!KiFastCallEntry+0xfc
0006f838 7c8213e4 77e416f1 00000788 00000000 ntdll!KiFastSystemCallRet
0006f83c 77e416f1 00000788 00000000 00000000
ntdll!NtDeviceIoControlFile+0xc
0006f8a0 00265210 00000788 003b0040 0006f9e4
kernel32!DeviceIoControl+0x137
0006f8d0 002676bd 003b0040 0006f9e4 00000008 ibal!do_al_dev_ioctl+0x57
[s:\builds\1660\trunk\core\al\user\al_dll.c @ 192]
0006f9ec 00268203 00089318 002518d8 00089318 ibal!ual_close_ca+0xd3
[s:\builds\1660\trunk\core\al\user\ual_ca.c @ 236]
0006fa04 00259ab0 00089318 0008932c 00089318 ibal!cleanup_ci_ca+0x4e
[s:\builds\1660\trunk\core\al\user\ual_ci_ca.c @ 328]
0006fa1c 00259780 0008932c 00084f48 00084fac ibal!async_destroy_cb+0x98
[s:\builds\1660\trunk\core\al\al_common.c @ 661]
0006fa30 002594a2 00089318 00000000 00000000 ibal!sync_destroy_obj+0xd7
[s:\builds\1660\trunk\core\al\al_common.c @ 548]
0006fa50 002596ea 00084f48 00000000 00000000 ibal!destroy_obj+0x150
[s:\builds\1660\trunk\core\al\al_common.c @ 615]
0006fa6c 0025b4a5 00084f48 00000000 00000000 ibal!sync_destroy_obj+0x41
[s:\builds\1660\trunk\core\al\al_common.c @ 488]
0006fa84 0026d51a 77bd27c2 0006fb1c 004026bb ibal!al_cleanup+0xa3
[s:\builds\1660\trunk\core\al\al_init.c @ 146]
0006fa90 004026bb 0008a9d8 00000000 00000000 ibal!ib_close_al+0x39
[s:\builds\1660\trunk\core\al\user\ual_mgr.c @ 1106]
0006fb1c 0100123b 0006fb38 00000400 ffffffff mtcr!mdevices+0xa1
[s:\builds\1660\trunk\tools\mft\user\mtcr\mtcr.c @ 473]
0006ff3c 010012c5 0006ffc0 0100157b 00000002 mst!list_devices+0x2b
[s:\builds\1660\trunk\tools\mft\user\mst\mst.c @ 18]
0006ff44 0100157b 00000002 002a24c0 002a2ac0 mst!main+0x28
[s:\builds\1660\trunk\tools\mft\user\mst\mst.c @ 41]
0006ffc0 77e523cd 00000000 00000000 7ffde000 mst!mainCRTStartup+0x12f
[d:\dnsrv\base\crts\crtw32\dllstuff\crtexe.c @ 501]
0006fff0 00000000 0100144c 00000000 78746341
kernel32!BaseProcessStart+0x23
 

STACK_COMMAND:  kb
 
FOLLOWUP_IP: 
ibbus!async_destroy_cb+da [s:\builds\1660\trunk\core\al\al_common.c @
675]
b940b35a f605243042b902 test byte ptr [ibbus!WPP_GLOBAL_Control+0x14
(b9423024)],0x2
 
FAULTING_SOURCE_CODE:  
   671:   p_obj->user_destroy_cb( (void*)p_obj->context );
   672:  }
   673: 
   674:  /* Free the resources associated with the object. */
>  675:  AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_AL_OBJ, ("freeing
object\n" ) );
   676:  p_obj->pfn_free( p_obj );
   677: 
   678:  /* Dereference the parent after freeing the child. */
   679:  if( p_parent_obj )
   680:   deref_al_obj( p_parent_obj );
 

SYMBOL_STACK_INDEX:  7
 
FOLLOWUP_NAME:  MachineOwner
 
SYMBOL_NAME:  ibbus!async_destroy_cb+da
 
MODULE_NAME:  ibbus
 
IMAGE_NAME:  ibbus.sys
 
DEBUG_FLR_IMAGE_TIMESTAMP:  44ffe561
 
FAILURE_BUCKET_ID:  0x50_VRF_CODE_AV_BAD_IP_ibbus!async_destroy_cb+da
 
BUCKET_ID:  0x50_VRF_CODE_AV_BAD_IP_ibbus!async_destroy_cb+da
 
Followup: MachineOwner



________________________________

	From: Tillier, Fabian [mailto:ftillier at silverstorm.com] 
	Sent: Tuesday, September 05, 2006 11:12 PM
	To: Yossi Leybovich
	Subject: RE: [Openib-windows] srp blue screen when CM fail to
connect
	
	

	Hi Yossi,

	
	I hadn't tried to compile it for x86, no.  I have a fix for this
in my sandbox, though.  Can you verify that the ib_sync_destroy changes
do work for you?  The only way you would crash during destruction is if
the -1 pointer for ib_sync_destroy was used as an actual function
pointer.  The patch I sent addressed the issue, so I suspect the IBAL
binary wasn't updated.  Let me know and I'll check in a patch that
handles both x64 and x86.

	 

	- Fab

	 

	
________________________________


	From: Yossi Leybovich [mailto:sleybo at mellanox.co.il] 
	Sent: Monday, September 04, 2006 1:11 AM
	To: Tillier, Fabian
	Subject: RE: [Openib-windows] srp blue screen when CM fail to
connect

	 

	  

	> -----Original Message----- 
	> From: ftillier.sst at gmail.com [mailto:ftillier.sst at gmail.com
<mailto:ftillier.sst at gmail.com> ] 
	> On Behalf Of Fabian Tillier 
	> Sent: Saturday, September 02, 2006 2:04 AM 
	> To: Yossi Leybovich 
	> Cc: openib-windows at openib.org 
	> Subject: Re: [Openib-windows] srp blue screen when CM fail to
connect 
	> 
	> Hi Yossi, 
	> 
	> On 8/31/06, Yossi Leybovich <sleybo at mellanox.co.il> wrote: 
	> > 
	> > Fab 
	> > I got blue screen while trying to bring our SRP target up. 
	> 
	> I have a fix, but I don't quite understand why it makes the 
	> problem go away.  I didn't find any double free issues in the 
	> code, but did take the opportunity to clean up the code a
little. 
	> 
	> Since I don't quite understand why it works now, I didn't 
	> check the changes in.  Instead the patch is attached - please 
	> give it a shot and let me know if you see the previous bug. 
	> 

	Did you try to copile this to 32 bit ? 
	I got 2 errors : 
	1. 
	2>errors in directory w:\work\latest\core\al\kernel 
	2>inc\iba\ib_al.h(436) : error C2220: warning treated as error -
no object file generated 
	2>inc\iba\ib_al.h(436) : error C4306: 'type cast' : conversion
from 'LONG_PTR' to 'ib_pfn_destroy_cb_t' o 
	f greater size 
	2. 
	100>errors in directory w:\work\latest\core\al\user 
	100>al_exports.def : error LNK2001: unresolved external symbol
ib_sync_destroy 
	100>wbin\user\objchk_wnet_x86\i386\ibald.lib : error LNK1120: 1
unresolved externals 

	 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20060910/be10b01a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ib_destroy.patch
Type: application/octet-stream
Size: 959 bytes
Desc: ib_destroy.patch
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20060910/be10b01a/attachment.obj>


More information about the ofw mailing list