[ofw] BSOD at winmad

Hefty, Sean sean.hefty at intel.com
Mon Nov 22 09:39:19 PST 2010


copying list on response

Okay - there is apparently an issue with winmad handling device removal (power exit) while there is an active user.  (Everything in the stack has this sort of issue, btw.)  I will need to look at the device removal code to see what the issue may be.

Winmad does the following during device removal:

void WmRegRemoveHandler(WM_REGISTRATION *pRegistration)
{
	ib_port_attr_mod_t	port_cap;

	if (pRegistration->pDevice == NULL) {
		return;
	}

	if (pRegistration->PortCapMask) {
		RtlZeroMemory(&port_cap.cap, sizeof(port_cap.cap));
		pRegistration->pDevice->IbInterface.modify_ca(pRegistration->hCa,
													  pRegistration->PortNum,
													  pRegistration->PortCapMask,
													  &port_cap);
	}

	WmProviderDeregister(pRegistration->pProvider, pRegistration);
	pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp, NULL);
	pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd, NULL);
	pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
	pRegistration->pDevice->IbInterface.close_al(pRegistration->hIbal);

	WmIbDevicePut(pRegistration->pDevice);
	pRegistration->pDevice = NULL;
}

The expectation was that after all of these calls return, no callbacks are in progress and no further callbacks will occur.
 
Can you can try replacing the NULL parameters above in WmRegRemoveHandler with 'ib_sync_destroy'?

Note that I'm not completely convinced that the locking used during device removal is correct.  But I would expect that to lead more to a deadlock condition than a blue screen.

> -----Original Message-----
> From: Alex Naslednikov [mailto:xalex at mellanox.co.il]
> Sent: Sunday, November 21, 2010 6:14 AM
> To: Hefty, Sean
> Subject: [ofw] BSOD at winmad
> 
> Hello Sean,
> 
> Recently, we got BSOD at winmad driver. I investigated the problem some more in depth, and you
> comments are more than welcome
> 
> 
> 
> 1.       Callstack:
> 
> winmad!WmReceiveHandler+0x45 [s:\builds\6872\trunk\core\winmad\kernel\wm_provider.c @ 378]
> 
> ibbus!__mad_svc_recv_done+0x9a9 [s:\builds\6872\trunk\core\al\al_mad.c @ 2217]
> 
> ibbus!mad_disp_recv_done+0x11c6 [s:\builds\6872\trunk\core\al\al_mad.c @ 1016]
> 
> ibbus!process_mad_recv+0x2f2 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2976]
> 
> ibbus!spl_qp_comp+0x2a1 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2806]
> 
> ibbus!spl_qp_recv_dpc_cb+0xcb [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2674]
> 
> 
> 
> 2.       BSOD on null pointer:
> 
> WdfObjectAcquireLock(prov->ReadQueue);
> 
>                 if (reg->hService == NULL) {
> 
>                                 reg->pDevice->IbInterface.put_mad(pMad);  ß pDevice == NULL
> 
>                                 goto unlock;
> 
>                 }
> 
> 3.       There are only 2 places where pDevice is set to Null : Init error flow (WmRegInit)  and
> Destroy(WmRegRemoveHandler)
> 
> I can suspect only the second case here, and it our case it happened because WmPowerD0Exit() was
> called.
> 
> That is, WmPowerD0Exit()->WmProviderRemoveHandler()->WmRegRemoveHandler()
> 
> 4.       On the other hand, WmReceiveHandler still was not removed . Theoretically, it can be caused
> by:
> 
> a.       Not all WM callbacks were cleaned
> 
> b.      Receiving of new MADs was stopped, but some MADs that were processed so far still trapped into
> WmReceiveHandler
> 
> 
> 
> 
> 
> 
> 
> Alexander (XaleX) Naslednikov
> 
> SW Networking Team
> 
> Mellanox Technologies
> 
> 




More information about the ofw mailing list