[ofw] Opensm & WinMad: a race, causing BSOD722

Fab Tillier ftillier at microsoft.com
Tue Jan 17 09:12:24 PST 2012


You're telling me that you don't have access to the script that killed OpenSM when it crashed?  Really?

Code review might be the only path to fixing this, but claiming you have no idea how it happened is false.  You own the scripts that were running at the time of the crash.  What were they trying to do?  Last we knew, you were starting and killing OpenSM, but now it seems you may have been doing that via shutdown, and not directly?  Which is it?  How can you not have any idea of the steps that led to this?

From: Uri Habusha [mailto:urih at mellanox.com]
Sent: Tuesday, January 17, 2012 9:07 AM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

No Fab you miss the point! I don't hate winmad and winverbs, I only love and hate people I don't have any feeling to SW component :)

For the point, we really don't have more information regarding the scenario that caused the BSOD. We don't do any target testing of SM. It can be due to shutdown, hibernate, driver restart, link disconnect/connect test and more.

How do you suggest to continue from this point. We can give up and ignore it. We can code review the code looks what can cause it. Add instrumental code or debug information trying to approve the assumption or collect more information for next time.
I think this is constructive approach...

From: Fab Tillier [mailto:ftillier at microsoft.com]<mailto:[mailto:ftillier at microsoft.com]>
Sent: Tuesday, January 17, 2012 6:58 PM
To: Uri Habusha; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

No, Uri, you're missing the point.  How did you stop OpenSM?  Was it immediately after it was started?  Did you let it run for a fixed amount of time?  The fact that it ran in your automated regression test environment is worthless information - it is not actionable.  You need to describe *what* your automated regression tests *do* allows someone to try to duplicate the problem, even if it's one in 10000.

I can't believe that we have to argue about having repro steps included in bug reports.  This is ridiculous.  If you have a bug, tell whoever you are assigning the bug to how you ran into it.  It really is not that much to ask for.

Look, we all know you hate WinMad and WinVerbs, but please try to be a little more constructive in your interaction with Sean, Stan, and the community.  The animosity is getting tiresome.  I'd say the same thing to Sean and Stan, and that was the main reason I responded - to let Leo know that the details in the bug report were good, unlike what Stan had suggested.  However, I still stand by my statement that leaving repro steps out makes the bug report near worthless.

-Fab

From: Uri Habusha [mailto:urih at mellanox.com]<mailto:[mailto:urih at mellanox.com]>
Sent: Monday, January 16, 2012 9:16 PM
To: Fab Tillier; Smith, Stan; Leonid Keller; Hefty, Sean
Cc: ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

Again you miss the point.

Not always we have a clear repro steps. If we have it the life was simple.
This issue happened during night run of our regression. The same test runs every night and this is the first time it happens, what means it is race/timing issue.

In such a case we need to review the code start thinking what can cause the issue and add instrumental code so next time it happens we understand better the scenario.

Uri

From: Fab Tillier [mailto:ftillier at microsoft.com]<mailto:[mailto:ftillier at microsoft.com]>
Sent: Tuesday, January 17, 2012 2:47 AM
To: Smith, Stan; Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: RE: Opensm & WinMad: a race, causing BSOD722

10D is the bugcheck code, WDF_VIOLATION.  The next 4 parameters are the bugcheck arguments.

Leonid, there's good detail in this report, but you're missing a critical 'repro steps' section that would allow this to be reproduced and debugged by the bug owner.  Just the crashdump information is only usually enough for simple bugs, and your analysis is probably on the right track.  Still, if you want someone else to look into it, it's best to tell them how they too can see it - debugging things live is more productive than debugging by code inspection.  How are Sean or Stan supposed to test that they have indeed fixed a bug, assuming they find something suspect and make code changes, when they have no idea how you triggered it?  Is it a consistent repro?  1 in a 100?

Thanks,
-Fab

From: ofw-bounces at lists.openfabrics.org<mailto:ofw-bounces at lists.openfabrics.org> [mailto:ofw-bounces at lists.openfabrics.org]<mailto:[mailto:ofw-bounces at lists.openfabrics.org]> On Behalf Of Smith, Stan
Sent: Monday, January 16, 2012 4:37 PM
To: Leonid Keller; Hefty, Sean
Cc: Uri Habusha; ofw_list; Irena Gannon
Subject: Re: [ofw] Opensm & WinMad: a race, causing BSOD722

What are the conditions required to generate the BSOD?
>From the terse description it appears you are shutting down OpenSM. How is this accomplished?

Opensm - 10D, {b, 76157f00, 0, 8811d008}.           Not clear what this statement implies? Please explain.

Stan.


From: Leonid Keller [mailto:leonid at mellanox.com]<mailto:[mailto:leonid at mellanox.com]>
Sent: Sunday, January 15, 2012 3:09 AM
To: Smith, Stan; Hefty, Sean
Cc: Uri Habusha; Irena Gannon; ofw_list
Subject: Opensm & WinMad: a race, cauing BSOD722

Hi guys,

We got a BSOD in Opensm - 10D, {b, 76157f00, 0, 8811d008}.
Could you take a look ?

My analysis:

Seems like BSOD has been caused by a  race between the main and MAD reading threads of Opensm.

The main thread has already closed the port and is now found in osm_subn_destroy():

opensm_main
                ...
                osm_mad_pool_destroy(&p_osm->mad_pool);
                osm_vendor_delete(&p_osm->p_vendor);                        // port release
                osm_subn_destroy(&p_osm->subn);                                    // the thread is found here now

The reading thread is still in action:

opensm!umad_receiver
                libibumad!umad_recv
                ...
                winmad!WmIoRead
                                winmad!WmProviderRead
WdfObjectAcquireLock(pProvider->ReadQueue);            // BSOD

A try to ReadQueue with !wdfqueue fails.
Seems like pProvider is already released. But there is no any checks of its validity in WmProviderRead().

Possible solution:
Maybe WmIoRead() should check, that the Provider is not being released and take some reference, while WmProviderRemoveHandler() should wait to this reference to be removed ?


Technical info:

// Bugcheck
WDF_VIOLATION (10d)
The Kernel-Mode Driver Framework has detected that Windows detected an error
in a framework-based driver. In general, the dump file will yield additional
information about the driver that caused this bug check.
Arguments:
Arg1: 0000000b, An attempt to acquire or release a lock was invalid. In this
                case, Parameter 3 further specifies the error that has been
                made.
Arg2: 76157f00, The handle value.
Arg3: 00000000, A handle passed to either WdfObjectAcquireLock or
                WdfObjectReleaseLock represents an object that does not
                support synchronization locks.
Arg4: 8811d008, Reserved.

// crash line
void WmProviderRead(WM_PROVIDER *pProvider, WDFREQUEST Request)
{
                WM_REGISTRATION                      *reg;
                NTSTATUS                                           status;
                WM_IO_MAD                                   *wmad;
                size_t                                                    outlen, len = 0;

                status = WdfRequestRetrieveOutputBuffer(Request, sizeof(WM_IO_MAD), &wmad, &outlen);
                if (!NT_SUCCESS(status)) {
                                goto out;
                }

                WdfObjectAcquireLock(pProvider->ReadQueue);
                if (pProvider->MadHead == NULL) {

// Local parameters
3: kd> ?? pProvider
struct _WM_PROVIDER * 0x894761c0
   +0x000 Entry            : _LIST_ENTRY [ 0x8e3ea070 - 0x8e3ea070 ]
   +0x008 RegIndex         : _INDEX_LIST
   +0x014 ReadQueue        : 0x76157f00 WDFQUEUE__
   +0x018 MadHead          : (null)
   +0x01c MadTail          : 0xbc67ccc8 _ib_mad_element
   +0x020 Lock             : _KGUARDED_MUTEX
   +0x040 Ref              : 0
   +0x044 Event            : _KEVENT
   +0x054 Pending          : 0
   +0x058 Active           : 0
   +0x05c SharedEvent      : _KEVENT
   +0x06c Exclusive        : 0
   +0x070 ExclusiveEvent   : _KEVENT

// ReadQueue
3: kd> !wdfqueue 0x76157f00

Dumping WDFQUEUE 0x76157f00
=========================
Manualcouldn't read 00000044
, Deleted, Disposing, Not power-managed, PowerOn, Cannot accept, Can dispatch, Dispatching, ExecutionLevelDispatch, SynchronizationScopeNone
    Number of driver owned requests: 0
    Number of waiting requests: 0


// opensm

3: kd> !PROCESS a14ba6c0
PROCESS a14ba6c0  SessionId: 1  Cid: 13fc    Peb: 7ffd9000  ParentCid: 10fc
    DirBase: bd5feee0  ObjectTable: bf7fa7b8  HandleCount:  41.
    Image: opensm.exe
    VadRoot 89c8ca38 Vads 51 Clone 0 Private 271. Modified 3. Locked 0.
    DeviceMap 81f41008
    Token                             a6e81030
    ElapsedTime                       00:00:00.890
    UserTime                          00:00:00.015
    KernelTime                        00:00:00.062
    QuotaPoolUsage[PagedPool]         0
    QuotaPoolUsage[NonPagedPool]      0
    Working Set Sizes (now,min,max)  (913, 50, 345) (3652KB, 200KB, 1380KB)
    PeakWorkingSetSize                930
    VirtualSize                       76 Mb
    PeakVirtualSize                   78 Mb
    PageFaultCount                    1047
    MemoryPriority                    BACKGROUND
    BasePriority                      8
    CommitCharge                      8768
    Job                               89e04368

        THREAD 89e05d48  Cid 13fc.1758  Teb: 7ffdf000 Win32Thread: fe333b18 RUNNING on processor 2
        Not impersonating
        DeviceMap                 81f41008
        Owning Process            0       Image:         <Unknown>
        Attached Process          a14ba6c0       Image:         opensm.exe
        Wait Start TickCount      5900630        Ticks: 1 (0:00:00:00.015)
        Context Switch Count      454
        UserTime                  00:00:00.015
        KernelTime                00:00:00.578
        Win32 Start Address opensm!mainCRTStartup (0x0042aaf5)
        Stack Init adf03fd0 Current adf038d0 Base adf04000 Limit adf01000 Call 0
        Priority 11 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
        ChildEBP RetAddr
        adf03cec 826be3a4 nt!KdPollBreakIn+0xea
        adf03cf0 826be381 nt!KdCheckForDebugBreak+0x17 (FPO: [0,0,0])
        adf03d24 8261f430 nt!KeUpdateRunTime+0x164
        adf03d24 003fd34a hal!HalpClockInterruptPn+0x158 (FPO: [0,2] TrapFrame @ adf03d34)
        000b2918 003d6174 opensm!osm_subn_destroy+0x1da (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\ulp\opensm\user\opensm\osm_subnet.c @ 501]
        000b2924 003ba03e opensm!osm_opensm_destroy+0x104 (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\ulp\opensm\user\opensm\osm_opensm.c @ 313]
        000cfda8 003bab21 opensm!opensm_main+0x1c3e (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\ulp\opensm\user\opensm\main.c @ 1264]
        000cfdf4 0042a9c4 opensm!main+0x191 (FPO: [Non-Fpo]) (CONV: cdecl) [a:\builds\9565\branches\mlnx_winof-3_0_0\ulp\opensm\user\opensm\main.c @ 1305]
        000cfe38 771c1194 opensm!__mainCRTStartup+0x102 (FPO: [Non-Fpo]) (CONV: cdecl) [d:\5359\minkernel\crts\crtw32\dllstuff\crtexe.c @ 695]
        000cfe44 7753b3f5 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
        000cfe84 7753b3c8 ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
        000cfe9c 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

        THREAD c8974030  Cid 13fc.16f4  Teb: 7ffd6000 Win32Thread: 00000000 RUNNING on processor 3
        IRP List:
            89e505b0: (0006,0094) Flags: 00060970  Mdl: 00000000
        Not impersonating
        DeviceMap                 81f41008
        Owning Process            0       Image:         <Unknown>
        Attached Process          a14ba6c0       Image:         opensm.exe
        Wait Start TickCount      5900629        Ticks: 2 (0:00:00:00.031)
        Context Switch Count      24
        UserTime                  00:00:00.000
        KernelTime                00:00:00.015
        Win32 Start Address complib!cl_thread_callback (0x72f409f0)
        Stack Init adf57fd0 Current adf57b44 Base adf58000 Limit adf55000 Call 0
        Priority 10 BasePriority 8 PriorityDecrement 2 IoPriority 2 PagePriority 5
        ChildEBP RetAddr
        adf576d4 8272fe71 nt!RtlpBreakWithStatusInstruction (FPO: [1,0,0])
        adf57724 8273096d nt!KiBugCheckDebugBreak+0x1c
        adf57ae8 8272fd10 nt!KeBugCheck2+0x68b
        adf57b08 8d956f9e nt!KeBugCheckEx+0x1e
        adf57b24 8d955829 Wdf01000!FxVerifierBugCheck+0x24 (FPO: [Non-Fpo])
        adf57b4c 8e3e6626 Wdf01000!imp_WdfObjectAcquireLock+0x26 (FPO: [Non-Fpo])
        adf57b5c 8e3e6a3c winmad!WdfObjectAcquireLock+0x16 (FPO: [Non-Fpo]) (CONV: stdcall) [c:\winddk\7600.16385.1\inc\wdf\kmdf\1.9\wdfsync.h @ 61]
        adf57b7c 8e3e5cdd winmad!WmProviderRead+0x3c (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\core\winmad\kernel\wm_provider.c @ 278]
        adf57b94 8d94502a winmad!WmIoRead+0x2d (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\core\winmad\kernel\wm_driver.c @ 128]
        adf57bb0 8d946256 Wdf01000!FxIoQueueIoRead::Invoke+0x2a (FPO: [Non-Fpo])
        adf57bd8 8d9489ac Wdf01000!FxIoQueue::DispatchRequestToDriver+0x1a3 (FPO: [Non-Fpo])
        adf57bf4 8d949a36 Wdf01000!FxIoQueue::DispatchEvents+0x3be (FPO: [Non-Fpo])
        adf57c14 8d94b824 Wdf01000!FxIoQueue::QueueRequest+0x1ec (FPO: [Non-Fpo])
        adf57c38 8d93aa3f Wdf01000!FxPkgIo::Dispatch+0x27d (FPO: [Non-Fpo])
        adf57c44 8268f4bc Wdf01000!FxDevice::Dispatch+0x7f (FPO: [Non-Fpo])
        adf57c5c 82890f6e nt!IofCallDriver+0x63
        adf57c7c 828a2f32 nt!IopSynchronousServiceTail+0x1f8
        adf57d08 8269644a nt!NtReadFile+0x644
        adf57d08 775264f4 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ adf57d34)
        007cfb84 7752570c ntdll!KiFastSystemCallRet (FPO: [0,0,0])
        007cfb88 757af249 ntdll!ZwReadFile+0xc (FPO: [9,0,0])
        007cfbec 771bdafd KERNELBASE!ReadFile+0xaa (FPO: [Non-Fpo])
        007cfc34 72fd1a25 kernel32!ReadFileImplementation+0xf0 (FPO: [Non-Fpo])
        007cfc58 74272910 winmad_72fd0000!CWMProvider::Receive+0xa5 (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\core\winmad\user\wm_provider.cpp @ 227]
        007cfc7c 0042d76a libibumad!umad_recv+0x50 (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\ulp\libibumad\src\umad.cpp @ 706]
        007cfd0c 72f40a04 opensm!umad_receiver+0xba (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\ulp\opensm\user\libvendor\osm_vendor_ibumad.c @ 266]
        007cfd18 771c1194 complib!cl_thread_callback+0x14 (FPO: [Non-Fpo]) (CONV: stdcall) [a:\builds\9565\branches\mlnx_winof-3_0_0\core\complib\user\cl_thread.c @ 49]
        007cfd24 7753b3f5 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
        007cfd64 7753b3c8 ntdll!__RtlUserThreadStart+0x70 (FPO: [Non-Fpo])
        007cfd7c 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20120117/742fa1eb/attachment.html>


More information about the ofw mailing list