[ofw] opensm stuck upon kill
Smith, Stan
stan.smith at intel.com
Thu Feb 2 10:55:11 PST 2012
Leo,
What are you saying exactly by 'opensm stuck on kill'? More kill info please.
Was OpenSM running as a service and via service control you said stop?
OpenSM running as a console application '--console local' and you typed the 'exit' command?
OpenSM running and you just killed the process?
Killed how?
Thanks,
Stan.
>-----Original Message-----
>From: Leonid Keller [mailto:leonid at mellanox.com]
>Sent: Thursday, February 02, 2012 6:42 AM
>To: Leonid Keller; Hefty, Sean; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: opensm stuck upon kill
>
>Hi guys,
>
>opensm got stuck upon kill
>I'll try to keep the full dump and will send you if you are interested.
>
>The stuck happens in IBAL upon releasing PD.
>
> nt!DbgBreakPoint
> ibbus!sync_destroy_obj+0xa61
> ibbus!destroy_obj+0x8ad
> ibbus!async_destroy_obj+0xa4
> ibbus!ib_dealloc_pd+0x2b6
> winmad!WmRegRemoveHandler+0xae
>...
>
>PD can't be released because its children AVs are not released:
>
>// from ibbus!sync_destroy_obj
>1: kd> ?? p_obj
>struct _al_obj * 0xa970fbbc
> ...
> +0x080 ref_cnt : 1
> ...
> +0x0a4 type : 3 //it's AV
> +0x0a8 state : 3 ( CL_DESTROYING )
> ...
>
>There are 227 children (AVs), which - as far as I understand, are created and attached to PD upon send_mad.
>There were several applications, that were running at the time of stuck, opensm was one of them.
>Opensm was killed and has now only one thread, the one which is stuck:
>
> [cda39020 opensm.exe]
> 83c.0003a8 9af686f0 0000002 RUNNING nt!DbgBreakPoint
> ibbus!sync_destroy_obj+0xa61
> ibbus!destroy_obj+0x8ad
> ibbus!async_destroy_obj+0xa4
> ibbus!ib_dealloc_pd+0x2b6
> winmad!WmRegRemoveHandler+0xae
> winmad!WmRegFree+0xe
> winmad!WmProviderCleanup+0x24
> winmad!WmFileCleanup+0x3a
> Wdf01000!FxFileObjectFileCleanup::Invoke+0x24
> Wdf01000!FxPkgGeneral::OnCleanup+0x57
> Wdf01000!FxPkgGeneral::Dispatch+0xcb
> Wdf01000!FxDevice::Dispatch+0x7f
> nt!IovCallDriver+0x23f
> nt!IofCallDriver+0x1b
> nt!IopCloseFile+0x387
> nt!ObpDecrementHandleCount+0x146
> nt!ObpCloseHandleTableEntry+0x234
> nt!ExSweepHandleTable+0x5f
> nt!ObKillProcess+0x54
> nt!PspExitThread+0x5b6
> nt!PsExitSpecialApc+0x22
> nt!KiDeliverApc+0x1dc
> nt!KiServiceExit+0x56
> ntdll!KiFastSystemCallRet
> ntdll!ZwWaitForWorkViaWorkerFactory+0xc
> ntdll!TppWorkerThread+0x1f6
> kernel32!BaseThreadInitThunk+0xe
> ntdll!__RtlUserThreadStart+0x23
> ntdll!_RtlUserThreadStart+
>
>winmad!WmRegRemoveHandler+0xae is standing here:
>
> WmProviderDeregister(pRegistration->pProvider, pRegistration);
> pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp, NULL);
> pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd, NULL);
>> pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
>
>Could you suggest some idea ?
>Thank you.
>
>
>-----Original Message-----
>From: Leonid Keller
>Sent: Tuesday, January 31, 2012 1:15 PM
>To: 'Hefty, Sean'; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: RE: Opensm & WinMad: a race, cauing BSOD722
>
>Thank you, Sean.
>
>Some comments.
>We do not think that this additional validation is necessary.
>It's hard to believe - unless you saw that - that Windows can call close(handle) after open(&handle) has failed.
>
>As to the patch to winverbs - it causes a crash, because WvProviderGet is called at DISPATCH level.
>
>ATTEMPTED_SWITCH_FROM_DPC (b8)
>A wait operation, attach process, or yield was attempted from a DPC routine.
>This is an illegal operation and the stack track will lead to the offending
>code and original DPC routine.
>
>nt!KiSwapContext+0x7f
>nt!KiSwapThread+0x2fa
>nt!KeWaitForGate+0x22a
>nt!KiAcquireGuardedMutex+0x35
>nt!KeAcquireGuardedMutex+0x39
>winverbs!WvProviderGet+0x1d
>winverbs!WvEpCompleteDisconnect+0x113
>winverbs!WvEpIbCmHandler+0x26a
>ibbus!cm_cep_handler+0x99
>ibbus!__process_cep+0x10f
>ibbus!__drep_handler+0x6ea
>ibbus!__cep_mad_recv_cb+0x246
>ibbus!__mad_svc_recv_done+0xb58
>ibbus!mad_disp_recv_done+0x1650
>ibbus!process_mad_recv+0x3bf
>ibbus!spl_qp_comp+0x3d2
>ibbus!spl_qp_recv_dpc_cb+0x112
>nt!KiRetireDpcList+0x117
>nt!KyRetireDpcList+0x5
>nt!KiDispatchInterruptContinue
>
>I've replaced mutex by spinlock - see below.
>I did it also for WinMad, albeit it has no asynchronous callbacks like WinVerbs.
>The main reason is to keep it similar to WinVerbs as it is today.
>A minor, mostly theoretical one: there are other functions, which are using today the provider mutex. It seems for me worthful to keep for
>them possibility to call a low-level WvProviderGet function.
>What's your opinion ?
>
>Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c (revision 9686)
>+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c (revision 9687)
>@@ -44,14 +44,15 @@
> LONG WvProviderGet(WV_PROVIDER *pProvider)
> {
> LONG val;
>+ KIRQL irql;
>
>- KeAcquireGuardedMutex(&pProvider->Lock);
>+ KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> val = InterlockedIncrement(&pProvider->Ref);
> if (val == 1) {
> pProvider->Ref = 0;
> val = 0;
> }
>- KeReleaseGuardedMutex(&pProvider->Lock);
>+ KeReleaseSpinLock(&pProvider->SpinLock, irql);
> return val;
> }
>
>@@ -119,6 +120,7 @@
> KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> pProvider->Exclusive = 0;
> KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, FALSE);
>+ KeInitializeSpinLock(&pProvider->SpinLock);
> return STATUS_SUCCESS;
> }
>
>Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h (revision 9686)
>+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h (revision 9687)
>@@ -80,6 +80,7 @@
> KEVENT ExclusiveEvent;
>
> WORK_QUEUE WorkQueue;
>+ KSPIN_LOCK SpinLock;
>
> } WV_PROVIDER;
>
>Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h (revision 9687)
>+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h (revision 9688)
>@@ -57,6 +57,7 @@
> KEVENT SharedEvent;
> LONG Exclusive;
> KEVENT ExclusiveEvent;
>+ KSPIN_LOCK SpinLock;
>
> } WM_PROVIDER;
>
>Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c (revision 9687)
>+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c (revision 9688)
>@@ -36,14 +36,15 @@
> LONG WmProviderGet(WM_PROVIDER *pProvider)
> {
> LONG val;
>+ KIRQL irql;
>
>- KeAcquireGuardedMutex(&pProvider->Lock);
>+ KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> val = InterlockedIncrement(&pProvider->Ref);
> if (val == 1) {
> pProvider->Ref = 0;
> val = 0;
> }
>- KeReleaseGuardedMutex(&pProvider->Lock);
>+ KeReleaseSpinLock(&pProvider->SpinLock, irql);
> return val;
> }
>
>@@ -72,6 +73,7 @@
> KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> pProvider->Exclusive = 0;
> KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, FALSE);
>+ KeInitializeSpinLock(&pProvider->SpinLock);
>
> ASSERT(ControlDevice != NULL);
>
>
>-----Original Message-----
>From: Hefty, Sean [mailto:sean.hefty at intel.com]
>Sent: Tuesday, January 31, 2012 12:08 AM
>To: Leonid Keller; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: RE: Opensm & WinMad: a race, cauing BSOD722
>
>> Two ideas:
>> WmProviderInit() is called without checking the return status. Is there a
>> reason ?
>> Seems like the similar patch is needed for WvIoDeviceControl().
>
>I can't tell whether IOCTLs suffer from the same problem or not. But since Windows is stupid, I went ahead and added the same protection
>to winverbs, plus some additional validation in case we get a cleanup event for a file for which we failed to create.
>
>
>
>
>- Sean
More information about the ofw
mailing list