[ofw] BSOD at winmad
Alex Naslednikov
xalex at mellanox.co.il
Thu Nov 25 06:28:50 PST 2010
Sean,
I don't sync that ib_sync_destroy() will help - please, see the scenario below (and correct me if I wrong)
1. WmRegRemoveHandler sets pRegistration->pDevice = NULL;
2. WmReceiveHandler() uses pReg->pDevice
3. The above callback was init at WmRegInit():
svc.mad_svc_context = pRegistration;
svc.pfn_mad_send_cb = WmSendHandler;
svc.pfn_mad_recv_cb = WmReceiveHandler;
svc.support_unsol = WmConvertMethods(&svc, pAttributes);
svc.mgmt_class = pAttributes->Class;
svc.mgmt_version = pAttributes->Version;
svc.svc_type = IB_MAD_SVC_DEFAULT;
ib_status = dev->IbInterface.reg_mad_svc(pRegistration->hQp, &svc,
&pRegistration->hService);
4. How can we ensure that this callback was removed before we cleared the pDevice pointer?
I.e., I am looking for something like call to dereg_mad_svc
5. Otherwise, such callback can occur even after we cleared the device pointer
-----Original Message-----
From: Hefty, Sean [mailto:sean.hefty at intel.com]
Sent: Monday, November 22, 2010 7:39 PM
To: Alex Naslednikov
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] BSOD at winmad
copying list on response
Okay - there is apparently an issue with winmad handling device removal (power exit) while there is an active user. (Everything in the stack has this sort of issue, btw.) I will need to look at the device removal code to see what the issue may be.
Winmad does the following during device removal:
void WmRegRemoveHandler(WM_REGISTRATION *pRegistration)
{
ib_port_attr_mod_t port_cap;
if (pRegistration->pDevice == NULL) {
return;
}
if (pRegistration->PortCapMask) {
RtlZeroMemory(&port_cap.cap, sizeof(port_cap.cap));
pRegistration->pDevice->IbInterface.modify_ca(pRegistration->hCa,
pRegistration->PortNum,
pRegistration->PortCapMask,
&port_cap);
}
WmProviderDeregister(pRegistration->pProvider, pRegistration);
pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp, NULL);
pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd, NULL);
pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
pRegistration->pDevice->IbInterface.close_al(pRegistration->hIbal);
WmIbDevicePut(pRegistration->pDevice);
pRegistration->pDevice = NULL;
}
The expectation was that after all of these calls return, no callbacks are in progress and no further callbacks will occur.
Can you can try replacing the NULL parameters above in WmRegRemoveHandler with 'ib_sync_destroy'?
Note that I'm not completely convinced that the locking used during device removal is correct. But I would expect that to lead more to a deadlock condition than a blue screen.
> -----Original Message-----
> From: Alex Naslednikov [mailto:xalex at mellanox.co.il]
> Sent: Sunday, November 21, 2010 6:14 AM
> To: Hefty, Sean
> Subject: [ofw] BSOD at winmad
>
> Hello Sean,
>
> Recently, we got BSOD at winmad driver. I investigated the problem some more in depth, and you
> comments are more than welcome
>
>
>
> 1. Callstack:
>
> winmad!WmReceiveHandler+0x45 [s:\builds\6872\trunk\core\winmad\kernel\wm_provider.c @ 378]
>
> ibbus!__mad_svc_recv_done+0x9a9 [s:\builds\6872\trunk\core\al\al_mad.c @ 2217]
>
> ibbus!mad_disp_recv_done+0x11c6 [s:\builds\6872\trunk\core\al\al_mad.c @ 1016]
>
> ibbus!process_mad_recv+0x2f2 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2976]
>
> ibbus!spl_qp_comp+0x2a1 [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2806]
>
> ibbus!spl_qp_recv_dpc_cb+0xcb [s:\builds\6872\trunk\core\al\kernel\al_smi.c @ 2674]
>
>
>
> 2. BSOD on null pointer:
>
> WdfObjectAcquireLock(prov->ReadQueue);
>
> if (reg->hService == NULL) {
>
> reg->pDevice->IbInterface.put_mad(pMad); ß pDevice == NULL
>
> goto unlock;
>
> }
>
> 3. There are only 2 places where pDevice is set to Null : Init error flow (WmRegInit) and
> Destroy(WmRegRemoveHandler)
>
> I can suspect only the second case here, and it our case it happened because WmPowerD0Exit() was
> called.
>
> That is, WmPowerD0Exit()->WmProviderRemoveHandler()->WmRegRemoveHandler()
>
> 4. On the other hand, WmReceiveHandler still was not removed . Theoretically, it can be caused
> by:
>
> a. Not all WM callbacks were cleaned
>
> b. Receiving of new MADs was stopped, but some MADs that were processed so far still trapped into
> WmReceiveHandler
>
>
>
>
>
>
>
> Alexander (XaleX) Naslednikov
>
> SW Networking Team
>
> Mellanox Technologies
>
>
More information about the ofw
mailing list