[ofw] opensm stuck upon kill
Hefty, Sean
sean.hefty at intel.com
Thu Feb 2 11:19:08 PST 2012
>From the traces, the cleanup of a killed user space process (in this case opensm) is hung in the kernel. IBAL is waiting forever on a reference count to drop to 0. From the details that were provided, either a large number of MADs have been leaked or there's a race condition somewhere that prevents AHs from being freed during the course of normal operation.
> -----Original Message-----
> From: Smith, Stan
> Sent: Thursday, February 02, 2012 10:55 AM
> To: Leonid Keller; Hefty, Sean; Tzachi Dar
> Cc: Uri Habusha; ofw_list; Irena Gannon
> Subject: RE: opensm stuck upon kill
>
> Leo,
> What are you saying exactly by 'opensm stuck on kill'? More kill info
> please.
>
> Was OpenSM running as a service and via service control you said stop?
> OpenSM running as a console application '--console local' and you typed the
> 'exit' command?
> OpenSM running and you just killed the process?
>
> Killed how?
>
> Thanks,
>
> Stan.
>
> >-----Original Message-----
> >From: Leonid Keller [mailto:leonid at mellanox.com]
> >Sent: Thursday, February 02, 2012 6:42 AM
> >To: Leonid Keller; Hefty, Sean; Tzachi Dar; Smith, Stan
> >Cc: Uri Habusha; ofw_list; Irena Gannon
> >Subject: opensm stuck upon kill
> >
> >Hi guys,
> >
> >opensm got stuck upon kill
> >I'll try to keep the full dump and will send you if you are interested.
> >
> >The stuck happens in IBAL upon releasing PD.
> >
> > nt!DbgBreakPoint
> > ibbus!sync_destroy_obj+0xa61
> > ibbus!destroy_obj+0x8ad
> > ibbus!async_destroy_obj+0xa4
> > ibbus!ib_dealloc_pd+0x2b6
> > winmad!WmRegRemoveHandler+0xae
> >...
> >
> >PD can't be released because its children AVs are not released:
> >
> >// from ibbus!sync_destroy_obj
> >1: kd> ?? p_obj
> >struct _al_obj * 0xa970fbbc
> > ...
> > +0x080 ref_cnt : 1
> > ...
> > +0x0a4 type : 3 //it's AV
> > +0x0a8 state : 3 ( CL_DESTROYING )
> > ...
> >
> >There are 227 children (AVs), which - as far as I understand, are created and
> attached to PD upon send_mad.
> >There were several applications, that were running at the time of stuck,
> opensm was one of them.
> >Opensm was killed and has now only one thread, the one which is stuck:
> >
> > [cda39020 opensm.exe]
> > 83c.0003a8 9af686f0 0000002 RUNNING nt!DbgBreakPoint
> > ibbus!sync_destroy_obj+0xa61
> > ibbus!destroy_obj+0x8ad
> > ibbus!async_destroy_obj+0xa4
> > ibbus!ib_dealloc_pd+0x2b6
> > winmad!WmRegRemoveHandler+0xae
> > winmad!WmRegFree+0xe
> > winmad!WmProviderCleanup+0x24
> > winmad!WmFileCleanup+0x3a
> >
> Wdf01000!FxFileObjectFileCleanup::Invoke+0x24
> > Wdf01000!FxPkgGeneral::OnCleanup+0x57
> > Wdf01000!FxPkgGeneral::Dispatch+0xcb
> > Wdf01000!FxDevice::Dispatch+0x7f
> > nt!IovCallDriver+0x23f
> > nt!IofCallDriver+0x1b
> > nt!IopCloseFile+0x387
> > nt!ObpDecrementHandleCount+0x146
> > nt!ObpCloseHandleTableEntry+0x234
> > nt!ExSweepHandleTable+0x5f
> > nt!ObKillProcess+0x54
> > nt!PspExitThread+0x5b6
> > nt!PsExitSpecialApc+0x22
> > nt!KiDeliverApc+0x1dc
> > nt!KiServiceExit+0x56
> > ntdll!KiFastSystemCallRet
> >
> ntdll!ZwWaitForWorkViaWorkerFactory+0xc
> > ntdll!TppWorkerThread+0x1f6
> > kernel32!BaseThreadInitThunk+0xe
> > ntdll!__RtlUserThreadStart+0x23
> > ntdll!_RtlUserThreadStart+
> >
> >winmad!WmRegRemoveHandler+0xae is standing here:
> >
> > WmProviderDeregister(pRegistration->pProvider, pRegistration);
> > pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp,
> NULL);
> > pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd,
> NULL);
> >> pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
> >
> >Could you suggest some idea ?
> >Thank you.
> >
> >
> >-----Original Message-----
> >From: Leonid Keller
> >Sent: Tuesday, January 31, 2012 1:15 PM
> >To: 'Hefty, Sean'; Tzachi Dar; Smith, Stan
> >Cc: Uri Habusha; ofw_list; Irena Gannon
> >Subject: RE: Opensm & WinMad: a race, cauing BSOD722
> >
> >Thank you, Sean.
> >
> >Some comments.
> >We do not think that this additional validation is necessary.
> >It's hard to believe - unless you saw that - that Windows can call
> close(handle) after open(&handle) has failed.
> >
> >As to the patch to winverbs - it causes a crash, because WvProviderGet is
> called at DISPATCH level.
> >
> >ATTEMPTED_SWITCH_FROM_DPC (b8)
> >A wait operation, attach process, or yield was attempted from a DPC routine.
> >This is an illegal operation and the stack track will lead to the offending
> >code and original DPC routine.
> >
> >nt!KiSwapContext+0x7f
> >nt!KiSwapThread+0x2fa
> >nt!KeWaitForGate+0x22a
> >nt!KiAcquireGuardedMutex+0x35
> >nt!KeAcquireGuardedMutex+0x39
> >winverbs!WvProviderGet+0x1d
> >winverbs!WvEpCompleteDisconnect+0x113
> >winverbs!WvEpIbCmHandler+0x26a
> >ibbus!cm_cep_handler+0x99
> >ibbus!__process_cep+0x10f
> >ibbus!__drep_handler+0x6ea
> >ibbus!__cep_mad_recv_cb+0x246
> >ibbus!__mad_svc_recv_done+0xb58
> >ibbus!mad_disp_recv_done+0x1650
> >ibbus!process_mad_recv+0x3bf
> >ibbus!spl_qp_comp+0x3d2
> >ibbus!spl_qp_recv_dpc_cb+0x112
> >nt!KiRetireDpcList+0x117
> >nt!KyRetireDpcList+0x5
> >nt!KiDispatchInterruptContinue
> >
> >I've replaced mutex by spinlock - see below.
> >I did it also for WinMad, albeit it has no asynchronous callbacks like
> WinVerbs.
> >The main reason is to keep it similar to WinVerbs as it is today.
> >A minor, mostly theoretical one: there are other functions, which are using
> today the provider mutex. It seems for me worthful to keep for
> >them possibility to call a low-level WvProviderGet function.
> >What's your opinion ?
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
> (revision 9686)
> >+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
> (revision 9687)
> >@@ -44,14 +44,15 @@
> > LONG WvProviderGet(WV_PROVIDER *pProvider)
> > {
> > LONG val;
> >+ KIRQL irql;
> >
> >- KeAcquireGuardedMutex(&pProvider->Lock);
> >+ KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> > val = InterlockedIncrement(&pProvider->Ref);
> > if (val == 1) {
> > pProvider->Ref = 0;
> > val = 0;
> > }
> >- KeReleaseGuardedMutex(&pProvider->Lock);
> >+ KeReleaseSpinLock(&pProvider->SpinLock, irql);
> > return val;
> > }
> >
> >@@ -119,6 +120,7 @@
> > KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> > pProvider->Exclusive = 0;
> > KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent,
> FALSE);
> >+ KeInitializeSpinLock(&pProvider->SpinLock);
> > return STATUS_SUCCESS;
> > }
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
> (revision 9686)
> >+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
> (revision 9687)
> >@@ -80,6 +80,7 @@
> > KEVENT ExclusiveEvent;
> >
> > WORK_QUEUE WorkQueue;
> >+ KSPIN_LOCK SpinLock;
> >
> > } WV_PROVIDER;
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
> (revision 9687)
> >+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
> (revision 9688)
> >@@ -57,6 +57,7 @@
> > KEVENT SharedEvent;
> > LONG Exclusive;
> > KEVENT ExclusiveEvent;
> >+ KSPIN_LOCK SpinLock;
> >
> > } WM_PROVIDER;
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
> (revision 9687)
> >+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
> (revision 9688)
> >@@ -36,14 +36,15 @@
> > LONG WmProviderGet(WM_PROVIDER *pProvider)
> > {
> > LONG val;
> >+ KIRQL irql;
> >
> >- KeAcquireGuardedMutex(&pProvider->Lock);
> >+ KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> > val = InterlockedIncrement(&pProvider->Ref);
> > if (val == 1) {
> > pProvider->Ref = 0;
> > val = 0;
> > }
> >- KeReleaseGuardedMutex(&pProvider->Lock);
> >+ KeReleaseSpinLock(&pProvider->SpinLock, irql);
> > return val;
> > }
> >
> >@@ -72,6 +73,7 @@
> > KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> > pProvider->Exclusive = 0;
> > KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent,
> FALSE);
> >+ KeInitializeSpinLock(&pProvider->SpinLock);
> >
> > ASSERT(ControlDevice != NULL);
> >
> >
> >-----Original Message-----
> >From: Hefty, Sean [mailto:sean.hefty at intel.com]
> >Sent: Tuesday, January 31, 2012 12:08 AM
> >To: Leonid Keller; Tzachi Dar; Smith, Stan
> >Cc: Uri Habusha; ofw_list; Irena Gannon
> >Subject: RE: Opensm & WinMad: a race, cauing BSOD722
> >
> >> Two ideas:
> >> WmProviderInit() is called without checking the return status. Is there a
> >> reason ?
> >> Seems like the similar patch is needed for WvIoDeviceControl().
> >
> >I can't tell whether IOCTLs suffer from the same problem or not. But since
> Windows is stupid, I went ahead and added the same protection
> >to winverbs, plus some additional validation in case we get a cleanup event
> for a file for which we failed to create.
> >
> >
> >
> >
> >- Sean
More information about the ofw
mailing list