[ofw] opensm stuck upon kill

Hefty, Sean sean.hefty at intel.com
Thu Feb 2 11:19:08 PST 2012


>From the traces, the cleanup of a killed user space process (in this case opensm) is hung in the kernel.  IBAL is waiting forever on a reference count to drop to 0.  From the details that were provided, either a large number of MADs have been leaked or there's a race condition somewhere that prevents AHs from being freed during the course of normal operation.


> -----Original Message-----
> From: Smith, Stan
> Sent: Thursday, February 02, 2012 10:55 AM
> To: Leonid Keller; Hefty, Sean; Tzachi Dar
> Cc: Uri Habusha; ofw_list; Irena Gannon
> Subject: RE: opensm stuck upon kill
> 
> Leo,
>   What are you saying exactly by 'opensm stuck on kill'? More kill info
> please.
> 
> Was OpenSM running as a service and via service control you said stop?
> OpenSM running as a console application '--console local' and you typed the
> 'exit' command?
> OpenSM running and you just killed the process?
> 
> Killed how?
> 
> Thanks,
> 
> Stan.
> 
> >-----Original Message-----
> >From: Leonid Keller [mailto:leonid at mellanox.com]
> >Sent: Thursday, February 02, 2012 6:42 AM
> >To: Leonid Keller; Hefty, Sean; Tzachi Dar; Smith, Stan
> >Cc: Uri Habusha; ofw_list; Irena Gannon
> >Subject: opensm stuck upon kill
> >
> >Hi guys,
> >
> >opensm got stuck upon kill
> >I'll try to keep the full dump and will send you if you are interested.
> >
> >The stuck happens in IBAL upon releasing PD.
> >
> > nt!DbgBreakPoint
> > ibbus!sync_destroy_obj+0xa61
> > ibbus!destroy_obj+0x8ad
> > ibbus!async_destroy_obj+0xa4
> > ibbus!ib_dealloc_pd+0x2b6
> > winmad!WmRegRemoveHandler+0xae
> >...
> >
> >PD can't be released because its children AVs are not released:
> >
> >// from ibbus!sync_destroy_obj
> >1: kd> ?? p_obj
> >struct _al_obj * 0xa970fbbc
> >   ...
> >   +0x080 ref_cnt          : 1
> >   ...
> >   +0x0a4 type             : 3		//it's AV
> >   +0x0a8 state            : 3 ( CL_DESTROYING )
> >   ...
> >
> >There are 227 children (AVs), which - as far as I understand, are created and
> attached to PD upon send_mad.
> >There were several applications, that were running at the time of stuck,
> opensm was one of them.
> >Opensm was killed and has now only one thread, the one which is stuck:
> >
> >                          [cda39020 opensm.exe]
> > 83c.0003a8  9af686f0 0000002 RUNNING    nt!DbgBreakPoint
> >                                        ibbus!sync_destroy_obj+0xa61
> >                                        ibbus!destroy_obj+0x8ad
> >                                        ibbus!async_destroy_obj+0xa4
> >                                        ibbus!ib_dealloc_pd+0x2b6
> >                                        winmad!WmRegRemoveHandler+0xae
> >                                        winmad!WmRegFree+0xe
> >                                        winmad!WmProviderCleanup+0x24
> >                                        winmad!WmFileCleanup+0x3a
> >
> Wdf01000!FxFileObjectFileCleanup::Invoke+0x24
> >                                        Wdf01000!FxPkgGeneral::OnCleanup+0x57
> >                                        Wdf01000!FxPkgGeneral::Dispatch+0xcb
> >                                        Wdf01000!FxDevice::Dispatch+0x7f
> >                                        nt!IovCallDriver+0x23f
> >                                        nt!IofCallDriver+0x1b
> >                                        nt!IopCloseFile+0x387
> >                                        nt!ObpDecrementHandleCount+0x146
> >                                        nt!ObpCloseHandleTableEntry+0x234
> >                                        nt!ExSweepHandleTable+0x5f
> >                                        nt!ObKillProcess+0x54
> >                                        nt!PspExitThread+0x5b6
> >                                        nt!PsExitSpecialApc+0x22
> >                                        nt!KiDeliverApc+0x1dc
> >                                        nt!KiServiceExit+0x56
> >                                        ntdll!KiFastSystemCallRet
> >
> ntdll!ZwWaitForWorkViaWorkerFactory+0xc
> >                                        ntdll!TppWorkerThread+0x1f6
> >                                        kernel32!BaseThreadInitThunk+0xe
> >                                        ntdll!__RtlUserThreadStart+0x23
> >                                        ntdll!_RtlUserThreadStart+
> >
> >winmad!WmRegRemoveHandler+0xae is standing here:
> >
> >	WmProviderDeregister(pRegistration->pProvider, pRegistration);
> >	pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp,
> NULL);
> >	pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd,
> NULL);
> >>	pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
> >
> >Could you suggest some idea ?
> >Thank you.
> >
> >
> >-----Original Message-----
> >From: Leonid Keller
> >Sent: Tuesday, January 31, 2012 1:15 PM
> >To: 'Hefty, Sean'; Tzachi Dar; Smith, Stan
> >Cc: Uri Habusha; ofw_list; Irena Gannon
> >Subject: RE: Opensm & WinMad: a race, cauing BSOD722
> >
> >Thank you, Sean.
> >
> >Some comments.
> >We do not think that this additional validation is necessary.
> >It's hard to believe - unless you saw that - that Windows can call
> close(handle) after open(&handle) has failed.
> >
> >As to the patch to winverbs - it causes a crash, because WvProviderGet is
> called at DISPATCH level.
> >
> >ATTEMPTED_SWITCH_FROM_DPC (b8)
> >A wait operation, attach process, or yield was attempted from a DPC routine.
> >This is an illegal operation and the stack track will lead to the offending
> >code and original DPC routine.
> >
> >nt!KiSwapContext+0x7f
> >nt!KiSwapThread+0x2fa
> >nt!KeWaitForGate+0x22a
> >nt!KiAcquireGuardedMutex+0x35
> >nt!KeAcquireGuardedMutex+0x39
> >winverbs!WvProviderGet+0x1d
> >winverbs!WvEpCompleteDisconnect+0x113
> >winverbs!WvEpIbCmHandler+0x26a
> >ibbus!cm_cep_handler+0x99
> >ibbus!__process_cep+0x10f
> >ibbus!__drep_handler+0x6ea
> >ibbus!__cep_mad_recv_cb+0x246
> >ibbus!__mad_svc_recv_done+0xb58
> >ibbus!mad_disp_recv_done+0x1650
> >ibbus!process_mad_recv+0x3bf
> >ibbus!spl_qp_comp+0x3d2
> >ibbus!spl_qp_recv_dpc_cb+0x112
> >nt!KiRetireDpcList+0x117
> >nt!KyRetireDpcList+0x5
> >nt!KiDispatchInterruptContinue
> >
> >I've replaced mutex by spinlock - see below.
> >I did it also for WinMad, albeit it has no asynchronous callbacks like
> WinVerbs.
> >The main reason is to keep it similar to WinVerbs as it is today.
> >A minor, mostly theoretical one: there are other functions, which are using
> today the provider mutex. It seems for me worthful to keep for
> >them possibility to call a low-level WvProviderGet function.
> >What's your opinion ?
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
> 	(revision 9686)
> >+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
> 	(revision 9687)
> >@@ -44,14 +44,15 @@
> > LONG WvProviderGet(WV_PROVIDER *pProvider)
> > {
> > 	LONG val;
> >+	KIRQL irql;
> >
> >-	KeAcquireGuardedMutex(&pProvider->Lock);
> >+	KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> > 	val = InterlockedIncrement(&pProvider->Ref);
> > 	if (val == 1) {
> > 		pProvider->Ref = 0;
> > 		val = 0;
> > 	}
> >-	KeReleaseGuardedMutex(&pProvider->Lock);
> >+	KeReleaseSpinLock(&pProvider->SpinLock, irql);
> > 	return val;
> > }
> >
> >@@ -119,6 +120,7 @@
> > 	KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> > 	pProvider->Exclusive = 0;
> > 	KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent,
> FALSE);
> >+	KeInitializeSpinLock(&pProvider->SpinLock);
> > 	return STATUS_SUCCESS;
> > }
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
> 	(revision 9686)
> >+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
> 	(revision 9687)
> >@@ -80,6 +80,7 @@
> > 	KEVENT			ExclusiveEvent;
> >
> > 	WORK_QUEUE		WorkQueue;
> >+	KSPIN_LOCK		SpinLock;
> >
> > }	WV_PROVIDER;
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
> 	(revision 9687)
> >+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
> 	(revision 9688)
> >@@ -57,6 +57,7 @@
> > 	KEVENT				SharedEvent;
> > 	LONG				Exclusive;
> > 	KEVENT				ExclusiveEvent;
> >+	KSPIN_LOCK			SpinLock;
> >
> > }	WM_PROVIDER;
> >
> >Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
> >===================================================================
> >--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
> 	(revision 9687)
> >+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
> 	(revision 9688)
> >@@ -36,14 +36,15 @@
> > LONG WmProviderGet(WM_PROVIDER *pProvider)
> > {
> > 	LONG val;
> >+	KIRQL irql;
> >
> >-	KeAcquireGuardedMutex(&pProvider->Lock);
> >+	KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> > 	val = InterlockedIncrement(&pProvider->Ref);
> > 	if (val == 1) {
> > 		pProvider->Ref = 0;
> > 		val = 0;
> > 	}
> >-	KeReleaseGuardedMutex(&pProvider->Lock);
> >+	KeReleaseSpinLock(&pProvider->SpinLock, irql);
> > 	return val;
> > }
> >
> >@@ -72,6 +73,7 @@
> > 	KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> > 	pProvider->Exclusive = 0;
> > 	KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent,
> FALSE);
> >+	KeInitializeSpinLock(&pProvider->SpinLock);
> >
> > 	ASSERT(ControlDevice != NULL);
> >
> >
> >-----Original Message-----
> >From: Hefty, Sean [mailto:sean.hefty at intel.com]
> >Sent: Tuesday, January 31, 2012 12:08 AM
> >To: Leonid Keller; Tzachi Dar; Smith, Stan
> >Cc: Uri Habusha; ofw_list; Irena Gannon
> >Subject: RE: Opensm & WinMad: a race, cauing BSOD722
> >
> >> Two ideas:
> >> WmProviderInit() is called without checking the return status. Is there a
> >> reason ?
> >> Seems like the similar patch is needed for WvIoDeviceControl().
> >
> >I can't tell whether IOCTLs suffer from the same problem or not.  But since
> Windows is stupid, I went ahead and added the same protection
> >to winverbs, plus some additional validation in case we get a cleanup event
> for a file for which we failed to create.
> >
> >
> >
> >
> >- Sean



More information about the ofw mailing list