[ofw] RE: bugcheck in mlx4_bus

Fab Tillier ftillier at microsoft.com
Thu Aug 20 11:15:52 PDT 2009


Is this running over IBAL or over WinVerbs?

Today, IBAL is responsible for tracking all memory registrations, and freeing them when the process exits.  I assume WinVerbs does the same, though maybe not?

The place to trap the process exiting is in IRP_MJ_CLEANUP, not IRP_MJ_CLOSE.

-Fab

> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org [mailto:ofw-
> bounces at lists.openfabrics.org] On Behalf Of Hefty, Sean
> Sent: Thursday, August 20, 2009 10:59 AM
> To: ofw at lists.openfabrics.org
> Subject: [ofw] bugcheck in mlx4_bus
>
> I hit a bugcheck yesterday while running Intel MPI PingPong tests on a
> single node, scaling up the number of ranks from 2 to 64.  The system
> is running Server 2003.  A bugcheck analysis suggested adding the
> following registry value:
>
> HKLM\System\CurrentControlSet\Control\Session Mgr\Memory
> Mgmt\TrackLockedPages
>
> DWORD with a value of 1
>
> This produced the bugcheck below while re-running the MPI PingPong
> tests.  I'm running checked drivers with free versions of the
> libraries.  It's possible this is pointing to a cleanup issue higher in
> the stack.  I'm trying to find more details.
>
> ***********************************************************************
> ******** * * *                        Bugcheck Analysis * * *
> ***********************************************************************
> ********
>
> DRIVER_LEFT_LOCKED_PAGES_IN_PROCESS (cb)
> Caused by a driver not cleaning up completely after an I/O.
> When possible, the guilty driver's name (Unicode string) is printed on
> the bugcheck screen and saved in KiBugCheckDriver.
> Arguments:
> Arg1: fffffadf8e0ae4f0, The calling address in the driver that locked
> the pages or if the
>         IO manager locked the pages this points to the dispatch routine
>         of the top driver on the stack to which the IRP was sent.
> Arg2: 0000000000000000, The caller of the calling address in the driver
> that locked the
>         pages. If the IO manager locked the pages this points to the
>         device object of the top driver on the stack to which the IRP was
> sent. Arg3: fffffadf980c6580, A pointer to the MDL containing the locked
> pages. Arg4: 0000000000000021, The number of locked pages.
>
> Debugging Details:
> ------------------
>
> PEB is paged out (Peb.Ldr = 000007ff`fffda018).  Type ".hh dbgerr001"
> for details
> PEB is paged out (Peb.Ldr = 000007ff`fffda018).  Type ".hh dbgerr001"
> for details
>
> FAULTING_IP: mlx4_bus!register_segment+100
> [c:\mshefty\scm\winof\branches\winverbs\hw\mlx4\kernel\bus\core\iobuf.c
> @ 197] fffffadf`8e0ae4f0 eb7d            jmp
> mlx4_bus!register_segment+0x17f (fffffadf`8e0ae56f)
>
> DEFAULT_BUCKET_ID:  DRIVER_FAULT
>
> BUGCHECK_STR:  0xCB
>
> PROCESS_NAME:  IMB-MPI1.exe
>
> CURRENT_IRQL:  f
>
> LAST_CONTROL_TRANSFER:  from fffff8000107984c to fffff80001026cf0
>
> STACK_TEXT: fffffadf`8e16ee28 fffff800`0107984c : 0000fadf`8ee3aa62
> 00000000`00004cb6 00000000`00000000 00000000`00000000 :
> nt!RtlpBreakWithStatusInstruction fffffadf`8e16ee30 fffff800`010c514e :
> 00000000`04d18000 00000000`dffe0000 00000000`04d18000 fffffadf`9aad51b0
> : nt!KdCheckForDebugBreak+0xb5 fffffadf`8e16ee70 fffff800`010d89bb :
> fffffadf`8e0ae400 00000000`00000000 00000000`00000000 00000000`000000cb
> : nt!IoWriteCrashDump+0x851 fffffadf`8e16f030 fffff800`0102e994 :
> fffff6fb`c0000000 fffff6fb`c0000000 fffffadf`988ba440 fffffadf`9b6b9340
> : nt!KeBugCheck2+0xb83 fffffadf`8e16f670 fffff800`01096f23 :
> 00000000`000000cb fffffadf`8e0ae4f0 00000000`00000000 fffffadf`980c6580
> : nt!KeBugCheckEx+0x104 fffffadf`8e16f6b0 fffff800`0127381a :
> fffffa80`01e7b960 fffffadf`8e16fc70 00000000`00000000 fffffadf`988ba440
> : nt!MmCleanProcessAddressSpace+0x904 fffffadf`8e16f720
> fffff800`0127bb72 : fffffadf`0000007b 00000000`0000007b
> fffffadf`988ba488 00000000`00000000 : nt!PspExitThread+0xb4d
> fffffadf`8e16f9b0 fffff800`01038c30 : 00000000`00000000
> fffffadf`8e16fcf0 00000520`657cb7f8 00000000`00000002 :
> nt!PsExitSpecialApc+0x1d fffffadf`8e16f9e0 fffff800`01027c3b :
> 00000000`00000000 fffffadf`8e16fa80 fffff800`0127bdc0 00000000`00000000
> : nt!KiDeliverApc+0x504 fffffadf`8e16fa80 fffff800`0102e3f2 :
> fffffadf`8e16fc18 00000000`00000000 00000000`00000001 fffffadf`9b8c6540
> : nt!KiInitiateUserApc+0x7b fffffadf`8e16fc00 00000000`77ef0a6a :
> 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000
> : nt!KiSystemServiceExit+0xad 00000000`0012f3a8 00000000`00000000 :
> 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000
> : 0x77ef0a6a
>
>
> STACK_COMMAND:  .bugcheck ; kb
>
> FOLLOWUP_IP: mlx4_bus!register_segment+100
> [c:\mshefty\scm\winof\branches\winverbs\hw\mlx4\kernel\bus\core\iobuf.c
> @ 197] fffffadf`8e0ae4f0 eb7d            jmp
> mlx4_bus!register_segment+0x17f (fffffadf`8e0ae56f)
>
> FAULTING_SOURCE_CODE:
>    193:         }
>    194:
>    195:         __try { /* try */
>    196:                 MmProbeAndLockPages( mdl_p, mode, Operation );
> /* lock memory */
>>  197:         } /* try */
>    198: 199:         __except (EXCEPTION_EXECUTE_HANDLER)    { 200:
>               MLX4_PRINT(TRACE_LEVEL_ERROR, MLX4_DBG_MEMORY, 201:
>                     ("MOSAL_iobuf_register: Exception 0x%x on
>    MmProbeAndLockPages(), va %I64d, sz %I64d\n", 202:
>         GetExceptionCode(), va, size));
>
>
> SYMBOL_NAME:  mlx4_bus!register_segment+100
>
> FOLLOWUP_NAME:  MachineOwner
>
> MODULE_NAME: mlx4_bus
>
> IMAGE_NAME:  mlx4_bus.sys
>
> DEBUG_FLR_IMAGE_TIMESTAMP:  4a8d77d7
>
> FAILURE_BUCKET_ID:  X64_0xCB_mlx4_bus!register_segment+100
>
> BUCKET_ID:  X64_0xCB_mlx4_bus!register_segment+100
>
> Followup: MachineOwner
> ---------
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw



More information about the ofw mailing list