[Openib-windows] srp blue screen when CM fail to connect
Yossi Leybovich
sleybo at mellanox.co.il
Sun Sep 10 04:19:22 PDT 2006
Fab
We still get the blue screen (even on new installation ) on object
destruction.
The problem is with ib_close_ca .
The ca object is using sync destruction (AL_OBJ_SUBTYPE_UM_EXPORT) but
its destruction still use ib_syn_destroy flag.
proxy_close_ca function:
/*
* Note that we hold a reference on the CA, so we need to
* call close_ca, not ib_close_ca. We also don't release the reference
* since close_ca will do so (by destroying the object).
*/
h_ca->obj.pfn_destroy( &h_ca->obj, ib_sync_destroy );
p_ioctl->out.status = IB_SUCCESS;
The code does not protect sync objects from using the ib_sync_destroy
function and we end up with calling 0xffffffff.
We can fix the call in proxy_close_ca but there are more places that use
ib_sync_destroy function
I think the way to solve this is to check ib_sync_destroy also in the
sync_destroy_obj function .
This patch fix the problem :
Index: W:/work/clean/core/al/al_common.c
===================================================================
--- W:/work/clean/core/al/al_common.c (revision 1666)
+++ W:/work/clean/core/al/al_common.c (revision 1667)
@@ -467,7 +467,7 @@
AL_ENTER( AL_DBG_AL_OBJ );
if( pfn_destroy_cb == ib_sync_destroy )
- sync_destroy_obj( p_obj, __sync_destroy_cb );
+ sync_destroy_obj( p_obj, pfn_destroy_cb );
else if( destroy_obj( p_obj, pfn_destroy_cb ) )
deref_al_obj( p_obj ); /* Only destroy the object once. */
@@ -482,10 +482,12 @@
IN const ib_pfn_destroy_cb_t pfn_destroy_cb )
{
cl_status_t cl_status;
+ ib_pfn_destroy_cb_t destroy_cb = (pfn_destroy_cb == ib_sync_destroy)
? __sync_destroy_cb :pfn_destroy_cb;
AL_ENTER( AL_DBG_AL_OBJ );
- if( !destroy_obj( p_obj, pfn_destroy_cb ) )
+
+ if( !destroy_obj( p_obj, destroy_cb ) )
{
/* Object is already being destroyed... */
AL_EXIT( AL_DBG_AL_OBJ );
this is the blue screen call stack:
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by
try-except,
it must be protected by a Probe. Typically the address is just plain
bad or it
is pointing at freed memory.
Arguments:
Arg1: ffffffff, memory referenced.
Arg2: 00000000, value 0 = read operation, 1 = write operation.
Arg3: ffffffff, If non-zero, the instruction address which referenced
the bad memory
address.
Arg4: 00000000, (reserved)
Debugging Details:
------------------
READ_ADDRESS: ffffffff
FAULTING_IP:
+ffffffffffffffff
ffffffff ?? ???
MM_INTERNAL_CODE: 0
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0x50
CURRENT_IRQL: 1
LAST_CONTROL_TRANSFER: from 8087a46f to 80833f96
FAILED_INSTRUCTION_ADDRESS:
+ffffffffffffffff
ffffffff ?? ???
STACK_TEXT:
b959e698 8087a46f 00000003 00000000 ffffffff
nt!RtlpBreakWithStatusInstruction
b959e6e4 8087b236 00000003 808b4120 88d848c8
nt!KiBugCheckDebugBreak+0x19
b959ea7c 8087b6be 00000050 ffffffff 00000000 nt!KeBugCheck2+0x5b2
b959ea9c 808689ee 00000050 ffffffff 00000000 nt!KeBugCheckEx+0x1b
b959eaec 80837d0a 00000000 ffffffff 00000000 nt!MmAccessFault+0x813
b959eaec ffffffff 00000000 ffffffff 00000000 nt!KiTrap0E+0xdc
WARNING: Frame IP not in any known module. Following frames may be
wrong.
b959eb74 b940b35a 00089318 88d86b2c 00000200 0xffffffff
b959eb8c b940b024 88d86b2c 89989d20 899a6cd8 ibbus!async_destroy_cb+0xda
[s:\builds\1660\trunk\core\al\al_common.c @ 675]
b959eba0 b9410b9e 88d86b18 ffffffff ffffffff ibbus!sync_destroy_obj+0xd6
[s:\builds\1660\trunk\core\al\al_common.c @ 548]
b959ebbc b94147de 8895b2c0 b959ebf4 8895b2c0 ibbus!proxy_close_ca+0xca
[s:\builds\1660\trunk\core\al\kernel\al_proxy_verbs.c @ 638]
b959ebd8 b93ee640 8895b2c0 b959ebf4 89cf32d8 ibbus!verbs_ioctl+0x24c
[s:\builds\1660\trunk\core\al\kernel\al_proxy_verbs.c @ 3451]
b959ebf8 b942411e 8895b2c0 80a78be4 89cf3220 ibbus!al_dev_ioctl+0xb2
[s:\builds\1660\trunk\core\al\kernel\al_dev.c @ 455]
b959ec10 809d457d 89cf3304 89cf32ec 8895b2c0 ibbus!bus_drv_ioctl+0x8e
[s:\builds\1660\trunk\core\bus\kernel\bus_driver.c @ 401]
b959ec40 80859657 8092d3b9 b959ec60 8092d3b9 nt!IovCallDriver+0x112
b959ec4c 8092d3b9 8895b354 8987f280 8895b2c0 nt!IofCallDriver+0x13
b959ec60 8092e81b 89cf3220 8895b2c0 8987f280
nt!IopSynchronousServiceTail+0x10b
b959ed00 80940844 00000788 00000000 00000000 nt!IopXxxControlFile+0x5db
b959ed34 80834d3f 00000788 00000000 00000000
nt!NtDeviceIoControlFile+0x2a
b959ed34 7c82ed54 00000788 00000000 00000000 nt!KiFastCallEntry+0xfc
0006f838 7c8213e4 77e416f1 00000788 00000000 ntdll!KiFastSystemCallRet
0006f83c 77e416f1 00000788 00000000 00000000
ntdll!NtDeviceIoControlFile+0xc
0006f8a0 00265210 00000788 003b0040 0006f9e4
kernel32!DeviceIoControl+0x137
0006f8d0 002676bd 003b0040 0006f9e4 00000008 ibal!do_al_dev_ioctl+0x57
[s:\builds\1660\trunk\core\al\user\al_dll.c @ 192]
0006f9ec 00268203 00089318 002518d8 00089318 ibal!ual_close_ca+0xd3
[s:\builds\1660\trunk\core\al\user\ual_ca.c @ 236]
0006fa04 00259ab0 00089318 0008932c 00089318 ibal!cleanup_ci_ca+0x4e
[s:\builds\1660\trunk\core\al\user\ual_ci_ca.c @ 328]
0006fa1c 00259780 0008932c 00084f48 00084fac ibal!async_destroy_cb+0x98
[s:\builds\1660\trunk\core\al\al_common.c @ 661]
0006fa30 002594a2 00089318 00000000 00000000 ibal!sync_destroy_obj+0xd7
[s:\builds\1660\trunk\core\al\al_common.c @ 548]
0006fa50 002596ea 00084f48 00000000 00000000 ibal!destroy_obj+0x150
[s:\builds\1660\trunk\core\al\al_common.c @ 615]
0006fa6c 0025b4a5 00084f48 00000000 00000000 ibal!sync_destroy_obj+0x41
[s:\builds\1660\trunk\core\al\al_common.c @ 488]
0006fa84 0026d51a 77bd27c2 0006fb1c 004026bb ibal!al_cleanup+0xa3
[s:\builds\1660\trunk\core\al\al_init.c @ 146]
0006fa90 004026bb 0008a9d8 00000000 00000000 ibal!ib_close_al+0x39
[s:\builds\1660\trunk\core\al\user\ual_mgr.c @ 1106]
0006fb1c 0100123b 0006fb38 00000400 ffffffff mtcr!mdevices+0xa1
[s:\builds\1660\trunk\tools\mft\user\mtcr\mtcr.c @ 473]
0006ff3c 010012c5 0006ffc0 0100157b 00000002 mst!list_devices+0x2b
[s:\builds\1660\trunk\tools\mft\user\mst\mst.c @ 18]
0006ff44 0100157b 00000002 002a24c0 002a2ac0 mst!main+0x28
[s:\builds\1660\trunk\tools\mft\user\mst\mst.c @ 41]
0006ffc0 77e523cd 00000000 00000000 7ffde000 mst!mainCRTStartup+0x12f
[d:\dnsrv\base\crts\crtw32\dllstuff\crtexe.c @ 501]
0006fff0 00000000 0100144c 00000000 78746341
kernel32!BaseProcessStart+0x23
STACK_COMMAND: kb
FOLLOWUP_IP:
ibbus!async_destroy_cb+da [s:\builds\1660\trunk\core\al\al_common.c @
675]
b940b35a f605243042b902 test byte ptr [ibbus!WPP_GLOBAL_Control+0x14
(b9423024)],0x2
FAULTING_SOURCE_CODE:
671: p_obj->user_destroy_cb( (void*)p_obj->context );
672: }
673:
674: /* Free the resources associated with the object. */
> 675: AL_PRINT( TRACE_LEVEL_INFORMATION, AL_DBG_AL_OBJ, ("freeing
object\n" ) );
676: p_obj->pfn_free( p_obj );
677:
678: /* Dereference the parent after freeing the child. */
679: if( p_parent_obj )
680: deref_al_obj( p_parent_obj );
SYMBOL_STACK_INDEX: 7
FOLLOWUP_NAME: MachineOwner
SYMBOL_NAME: ibbus!async_destroy_cb+da
MODULE_NAME: ibbus
IMAGE_NAME: ibbus.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 44ffe561
FAILURE_BUCKET_ID: 0x50_VRF_CODE_AV_BAD_IP_ibbus!async_destroy_cb+da
BUCKET_ID: 0x50_VRF_CODE_AV_BAD_IP_ibbus!async_destroy_cb+da
Followup: MachineOwner
________________________________
From: Tillier, Fabian [mailto:ftillier at silverstorm.com]
Sent: Tuesday, September 05, 2006 11:12 PM
To: Yossi Leybovich
Subject: RE: [Openib-windows] srp blue screen when CM fail to
connect
Hi Yossi,
I hadn't tried to compile it for x86, no. I have a fix for this
in my sandbox, though. Can you verify that the ib_sync_destroy changes
do work for you? The only way you would crash during destruction is if
the -1 pointer for ib_sync_destroy was used as an actual function
pointer. The patch I sent addressed the issue, so I suspect the IBAL
binary wasn't updated. Let me know and I'll check in a patch that
handles both x64 and x86.
- Fab
________________________________
From: Yossi Leybovich [mailto:sleybo at mellanox.co.il]
Sent: Monday, September 04, 2006 1:11 AM
To: Tillier, Fabian
Subject: RE: [Openib-windows] srp blue screen when CM fail to
connect
> -----Original Message-----
> From: ftillier.sst at gmail.com [mailto:ftillier.sst at gmail.com
<mailto:ftillier.sst at gmail.com> ]
> On Behalf Of Fabian Tillier
> Sent: Saturday, September 02, 2006 2:04 AM
> To: Yossi Leybovich
> Cc: openib-windows at openib.org
> Subject: Re: [Openib-windows] srp blue screen when CM fail to
connect
>
> Hi Yossi,
>
> On 8/31/06, Yossi Leybovich <sleybo at mellanox.co.il> wrote:
> >
> > Fab
> > I got blue screen while trying to bring our SRP target up.
>
> I have a fix, but I don't quite understand why it makes the
> problem go away. I didn't find any double free issues in the
> code, but did take the opportunity to clean up the code a
little.
>
> Since I don't quite understand why it works now, I didn't
> check the changes in. Instead the patch is attached - please
> give it a shot and let me know if you see the previous bug.
>
Did you try to copile this to 32 bit ?
I got 2 errors :
1.
2>errors in directory w:\work\latest\core\al\kernel
2>inc\iba\ib_al.h(436) : error C2220: warning treated as error -
no object file generated
2>inc\iba\ib_al.h(436) : error C4306: 'type cast' : conversion
from 'LONG_PTR' to 'ib_pfn_destroy_cb_t' o
f greater size
2.
100>errors in directory w:\work\latest\core\al\user
100>al_exports.def : error LNK2001: unresolved external symbol
ib_sync_destroy
100>wbin\user\objchk_wnet_x86\i386\ibald.lib : error LNK1120: 1
unresolved externals
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20060910/be10b01a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ib_destroy.patch
Type: application/octet-stream
Size: 959 bytes
Desc: ib_destroy.patch
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20060910/be10b01a/attachment.obj>
More information about the ofw
mailing list