[openib-general] [Bug 203] New: Crash on shutdown, timer callback, build 459

bugzilla-daemon at openib.org bugzilla-daemon at openib.org
Tue Aug 22 04:25:00 PDT 2006


http://openib.org/bugzilla/show_bug.cgi?id=203

           Summary: Crash on shutdown, timer callback, build 459
           Product: OpenFabrics Windows
           Version: unspecified
          Platform: X86
        OS/Version: Other
            Status: NEW
          Severity: major
          Priority: P2
         Component: Core
        AssignedTo: bugzilla at openib.org
        ReportedBy: jbottorff at xsigo.com


While trying to debug some of the shutdown hangs I see, I configured a pair of
32-bit W2k3 sp1 systems back to back with no switch. I then ran opensm on one
(free OS build, checked IB drivers), and had a script cycle reboots on the
other (checked OS build, driver verifer, checked IB drivers).

After just a few reboots, I had a very curious crash (which I had never seen
before), which seemed to be repeatable every few reboots. The crash would occur
in a timer callback when trying to dereference a garbage context value (it
always had the value 0x1).

I suspect what may be happening is some IB object that contains a cl_timer_t
object is getting deallocated while the timer is still active. The memory
containing the cl_timer_t is overwritten (reallocated?) and the context value
for the timer callback dpc is destroyed.

So now all the details:

Here is the initial crash analysis:

0: kd> !analyze -v
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 00000051, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, value 0 = read operation, 1 = write operation
Arg4: baa4b1c8, address which referenced memory

Debugging Details:
------------------


OVERLAPPED_MODULE: Address regions for 'Fips' and 'imapi.sys' overlap

WRITE_ADDRESS:  00000051 

CURRENT_IRQL:  2

FAULTING_IP: 
ibbus!__timer_callback+8
[k:\windows-openib\src\winib-459\core\complib\kernel\cl_timer.c @ 48]
baa4b1c8 c7405000000000   mov     dword ptr [eax+0x50],0x0

DEFAULT_BUCKET_ID:  DRIVER_FAULT

BUGCHECK_STR:  0xD1

LAST_CONTROL_TRANSFER:  from 8063717b to 8075cc0c

STACK_TEXT:  
f78a2a5c 8063717b 00000003 00000000 0000000a nt!RtlpBreakWithStatusInstruction
f78a2aa8 806380d8 00000003 00000051 baa4b1c8 nt!KiBugCheckDebugBreak+0x19
f78a2e40 8077f6ef 0000000a 00000051 00000002 nt!KeBugCheck2+0x5b2
f78a2e40 baa4b1c8 0000000a 00000051 00000002 nt!KiTrap0E+0x2af
f78a2ed4 8064858a 88e76fc0 88e76e40 f1e309be ibbus!__timer_callback+0x8
[k:\windows-openib\src\winib-459\core\complib\kernel\cl_timer.c @ 48]
f78a2f9c 80648a46 00000000 00000000 025ed741 nt!KiTimerExpiration+0x660
f78a2ff4 80780d8f f78da208 00000000 00000000 nt!KiRetireDpcList+0x62
f78a2ff8 f78da208 00000000 00000000 00000000 nt!KiDispatchInterrupt+0x3f
WARNING: Frame IP not in any known module. Following frames may be wrong.
80780d8f 00000000 0000000a bb837775 00000128 0xf78da208


STACK_COMMAND:  .bugcheck ; kb

FOLLOWUP_IP: 
ibbus!__timer_callback+8
[k:\windows-openib\src\winib-459\core\complib\kernel\cl_timer.c @ 48]
baa4b1c8 c7405000000000   mov     dword ptr [eax+0x50],0x0

FAULTING_SOURCE_CODE:  
    44:         UNUSED_PARAM( p_dpc );
    45:         UNUSED_PARAM( arg1 );
    46:         UNUSED_PARAM( arg2 );
    47: 
>   48: 	p_timer->timeout_time = 0;
    49: 
    50:         (p_timer->pfn_callback)( (void*)p_timer->context );
    51: }
    52: 
    53: 


SYMBOL_STACK_INDEX:  4

FOLLOWUP_NAME:  MachineOwner

SYMBOL_NAME:  ibbus!__timer_callback+8

MODULE_NAME:  ibbus

IMAGE_NAME:  ibbus.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  44e65c89

FAILURE_BUCKET_ID:  0xD1_W_VRF_ibbus!__timer_callback+8

BUCKET_ID:  0xD1_W_VRF_ibbus!__timer_callback+8

Followup: MachineOwner
---------

The direct cause of the crash is the local variable p_timer has a value of 0x1.

A dump of the dpc object confirms the invalid context value:
0: kd> dt p_dpc
Local var @ 0xf78a2edc Type _KDPC*
0x88e76fc0 
   +0x000 Type             : 0x13 ''
   +0x001 Importance       : 0x1 ''
   +0x002 Number           : 0 ''
   +0x003 Expedite         : 0 ''
   +0x004 DpcListEntry     : _LIST_ENTRY [ 0x0 - 0x0 ]
   +0x00c DeferredRoutine  : 0xbaa4b1c0     ibbus!__timer_callback+0
   +0x010 DeferredContext  : 0x00000001 
   +0x014 SystemArgument1  : (null) 
   +0x018 SystemArgument2  : (null) 
   +0x01c DpcData          : (null) 


A little digging seems to say the dpc object should be contained in what
p_timer points at, so we get a pool dump of the dpc object, which should tell
us the allocation it's contained in, which it seems to say:

0: kd> !pool 0x88e76fc0
Pool page 88e76fc0 region is Special pool
*88e76e38 size:  1c8 non-paged special pool, Tag is Ddk 
                Pooltag Ddk  : Default for driver allocated memory (user's of
ntddk.h)

Since we know the dpc address, and we know the offset of the dpc inside the
cl_time_t, we can calculate what p_timer should have been, and get a structured
dump, which says:

0: kd> dt p_timer
Local var @ 0xf78a2ee0 Type _cl_timer*
0x88e76f98 
   +0x000 timer            : _KTIMER
   +0x028 dpc              : _KDPC
   +0x048 pfn_callback     : 0xba9dba60     ibbus!__recv_timer_cb+0
   +0x04c context          : 0x88e76e38 
   +0x050 timeout_time     : 0x42cbcd5

Since we also know from !pool where the allocation started, and where the
cl_timer_t is, we calculate it's offset as 0x160 (+/- some pool header). I see
the context value also now matches what !pool said was the allocation start.

Offhand, a al_mad_svc_t looks like a potential candidate as the parent object,
although don't know if those really have a size of 0x1c8. This is suggesting
what happened is a timer was not canceled when a mad object was destroyed. Or
maybey the mad was waiting a reply when it was destroyed, and didn't get
correctly cleaned up.

I have a crash dump written to a file, and matching sources (459 rev) and
symbols, if we need to dig some more. This crash also seems very reproducable
at the moment.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the general mailing list