[ofw] opensm stuck upon kill

Smith, Stan stan.smith at intel.com
Thu Feb 2 10:55:11 PST 2012


Leo,
  What are you saying exactly by 'opensm stuck on kill'? More kill info please.

Was OpenSM running as a service and via service control you said stop?
OpenSM running as a console application '--console local' and you typed the 'exit' command?
OpenSM running and you just killed the process?

Killed how?

Thanks,

Stan.

>-----Original Message-----
>From: Leonid Keller [mailto:leonid at mellanox.com]
>Sent: Thursday, February 02, 2012 6:42 AM
>To: Leonid Keller; Hefty, Sean; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: opensm stuck upon kill
>
>Hi guys,
>
>opensm got stuck upon kill
>I'll try to keep the full dump and will send you if you are interested.
>
>The stuck happens in IBAL upon releasing PD.
>
> nt!DbgBreakPoint
> ibbus!sync_destroy_obj+0xa61
> ibbus!destroy_obj+0x8ad
> ibbus!async_destroy_obj+0xa4
> ibbus!ib_dealloc_pd+0x2b6
> winmad!WmRegRemoveHandler+0xae
>...
>
>PD can't be released because its children AVs are not released:
>
>// from ibbus!sync_destroy_obj
>1: kd> ?? p_obj
>struct _al_obj * 0xa970fbbc
>   ...
>   +0x080 ref_cnt          : 1
>   ...
>   +0x0a4 type             : 3		//it's AV
>   +0x0a8 state            : 3 ( CL_DESTROYING )
>   ...
>
>There are 227 children (AVs), which - as far as I understand, are created and attached to PD upon send_mad.
>There were several applications, that were running at the time of stuck, opensm was one of them.
>Opensm was killed and has now only one thread, the one which is stuck:
>
>                          [cda39020 opensm.exe]
> 83c.0003a8  9af686f0 0000002 RUNNING    nt!DbgBreakPoint
>                                        ibbus!sync_destroy_obj+0xa61
>                                        ibbus!destroy_obj+0x8ad
>                                        ibbus!async_destroy_obj+0xa4
>                                        ibbus!ib_dealloc_pd+0x2b6
>                                        winmad!WmRegRemoveHandler+0xae
>                                        winmad!WmRegFree+0xe
>                                        winmad!WmProviderCleanup+0x24
>                                        winmad!WmFileCleanup+0x3a
>                                        Wdf01000!FxFileObjectFileCleanup::Invoke+0x24
>                                        Wdf01000!FxPkgGeneral::OnCleanup+0x57
>                                        Wdf01000!FxPkgGeneral::Dispatch+0xcb
>                                        Wdf01000!FxDevice::Dispatch+0x7f
>                                        nt!IovCallDriver+0x23f
>                                        nt!IofCallDriver+0x1b
>                                        nt!IopCloseFile+0x387
>                                        nt!ObpDecrementHandleCount+0x146
>                                        nt!ObpCloseHandleTableEntry+0x234
>                                        nt!ExSweepHandleTable+0x5f
>                                        nt!ObKillProcess+0x54
>                                        nt!PspExitThread+0x5b6
>                                        nt!PsExitSpecialApc+0x22
>                                        nt!KiDeliverApc+0x1dc
>                                        nt!KiServiceExit+0x56
>                                        ntdll!KiFastSystemCallRet
>                                        ntdll!ZwWaitForWorkViaWorkerFactory+0xc
>                                        ntdll!TppWorkerThread+0x1f6
>                                        kernel32!BaseThreadInitThunk+0xe
>                                        ntdll!__RtlUserThreadStart+0x23
>                                        ntdll!_RtlUserThreadStart+
>
>winmad!WmRegRemoveHandler+0xae is standing here:
>
>	WmProviderDeregister(pRegistration->pProvider, pRegistration);
>	pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp, NULL);
>	pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd, NULL);
>>	pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
>
>Could you suggest some idea ?
>Thank you.
>
>
>-----Original Message-----
>From: Leonid Keller
>Sent: Tuesday, January 31, 2012 1:15 PM
>To: 'Hefty, Sean'; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: RE: Opensm & WinMad: a race, cauing BSOD722
>
>Thank you, Sean.
>
>Some comments.
>We do not think that this additional validation is necessary.
>It's hard to believe - unless you saw that - that Windows can call close(handle) after open(&handle) has failed.
>
>As to the patch to winverbs - it causes a crash, because WvProviderGet is called at DISPATCH level.
>
>ATTEMPTED_SWITCH_FROM_DPC (b8)
>A wait operation, attach process, or yield was attempted from a DPC routine.
>This is an illegal operation and the stack track will lead to the offending
>code and original DPC routine.
>
>nt!KiSwapContext+0x7f
>nt!KiSwapThread+0x2fa
>nt!KeWaitForGate+0x22a
>nt!KiAcquireGuardedMutex+0x35
>nt!KeAcquireGuardedMutex+0x39
>winverbs!WvProviderGet+0x1d
>winverbs!WvEpCompleteDisconnect+0x113
>winverbs!WvEpIbCmHandler+0x26a
>ibbus!cm_cep_handler+0x99
>ibbus!__process_cep+0x10f
>ibbus!__drep_handler+0x6ea
>ibbus!__cep_mad_recv_cb+0x246
>ibbus!__mad_svc_recv_done+0xb58
>ibbus!mad_disp_recv_done+0x1650
>ibbus!process_mad_recv+0x3bf
>ibbus!spl_qp_comp+0x3d2
>ibbus!spl_qp_recv_dpc_cb+0x112
>nt!KiRetireDpcList+0x117
>nt!KyRetireDpcList+0x5
>nt!KiDispatchInterruptContinue
>
>I've replaced mutex by spinlock - see below.
>I did it also for WinMad, albeit it has no asynchronous callbacks like WinVerbs.
>The main reason is to keep it similar to WinVerbs as it is today.
>A minor, mostly theoretical one: there are other functions, which are using today the provider mutex. It seems for me worthful to keep for
>them possibility to call a low-level WvProviderGet function.
>What's your opinion ?
>
>Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c	(revision 9686)
>+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c	(revision 9687)
>@@ -44,14 +44,15 @@
> LONG WvProviderGet(WV_PROVIDER *pProvider)
> {
> 	LONG val;
>+	KIRQL irql;
>
>-	KeAcquireGuardedMutex(&pProvider->Lock);
>+	KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> 	val = InterlockedIncrement(&pProvider->Ref);
> 	if (val == 1) {
> 		pProvider->Ref = 0;
> 		val = 0;
> 	}
>-	KeReleaseGuardedMutex(&pProvider->Lock);
>+	KeReleaseSpinLock(&pProvider->SpinLock, irql);
> 	return val;
> }
>
>@@ -119,6 +120,7 @@
> 	KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> 	pProvider->Exclusive = 0;
> 	KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, FALSE);
>+	KeInitializeSpinLock(&pProvider->SpinLock);
> 	return STATUS_SUCCESS;
> }
>
>Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h	(revision 9686)
>+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h	(revision 9687)
>@@ -80,6 +80,7 @@
> 	KEVENT			ExclusiveEvent;
>
> 	WORK_QUEUE		WorkQueue;
>+	KSPIN_LOCK		SpinLock;
>
> }	WV_PROVIDER;
>
>Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h	(revision 9687)
>+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h	(revision 9688)
>@@ -57,6 +57,7 @@
> 	KEVENT				SharedEvent;
> 	LONG				Exclusive;
> 	KEVENT				ExclusiveEvent;
>+	KSPIN_LOCK			SpinLock;
>
> }	WM_PROVIDER;
>
>Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c	(revision 9687)
>+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c	(revision 9688)
>@@ -36,14 +36,15 @@
> LONG WmProviderGet(WM_PROVIDER *pProvider)
> {
> 	LONG val;
>+	KIRQL irql;
>
>-	KeAcquireGuardedMutex(&pProvider->Lock);
>+	KeAcquireSpinLock(&pProvider->SpinLock, &irql);
> 	val = InterlockedIncrement(&pProvider->Ref);
> 	if (val == 1) {
> 		pProvider->Ref = 0;
> 		val = 0;
> 	}
>-	KeReleaseGuardedMutex(&pProvider->Lock);
>+	KeReleaseSpinLock(&pProvider->SpinLock, irql);
> 	return val;
> }
>
>@@ -72,6 +73,7 @@
> 	KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
> 	pProvider->Exclusive = 0;
> 	KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, FALSE);
>+	KeInitializeSpinLock(&pProvider->SpinLock);
>
> 	ASSERT(ControlDevice != NULL);
>
>
>-----Original Message-----
>From: Hefty, Sean [mailto:sean.hefty at intel.com]
>Sent: Tuesday, January 31, 2012 12:08 AM
>To: Leonid Keller; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: RE: Opensm & WinMad: a race, cauing BSOD722
>
>> Two ideas:
>> WmProviderInit() is called without checking the return status. Is there a
>> reason ?
>> Seems like the similar patch is needed for WvIoDeviceControl().
>
>I can't tell whether IOCTLs suffer from the same problem or not.  But since Windows is stupid, I went ahead and added the same protection
>to winverbs, plus some additional validation in case we get a cleanup event for a file for which we failed to create.
>
>
>
>
>- Sean



More information about the ofw mailing list