[ofw] Opensm & WinMad: a race, cauing BSOD722

Hefty, Sean sean.hefty at intel.com
Wed Jan 25 12:09:05 PST 2012


> We got a BSOD in Opensm - 10D, {b, 76157f00, 0, 8811d008}.
> 
> Could you take a look ?

I looked into this, and I can't say that I see anything wrong in the code.  :(
 
> Seems like BSOD has been caused by a  race between the main and MAD reading
> threads of Opensm.
> 
> The main thread has already closed the port and is now found in
> osm_subn_destroy():
> 
> opensm_main
>                 ...
>                 osm_mad_pool_destroy(&p_osm->mad_pool);
>                 osm_vendor_delete(&p_osm->p_vendor);                        //
> port release
>                 osm_subn_destroy(&p_osm->subn);
> // the thread is found here now
> 
> The reading thread is still in action:
> 
> opensm!umad_receiver
>                 libibumad!umad_recv
>                 ...
>                 winmad!WmIoRead
>                                 winmad!WmProviderRead
> WdfObjectAcquireLock(pProvider->ReadQueue);            // BSOD
> 
> A try to ReadQueue with !wdfqueue fails.
> 
> Seems like pProvider is already released. But there is no any checks of its
> validity in WmProviderRead().

The pProvider->Ref is set to 0, which strongly suggests that the provider has been released.
 
> Possible solution:
> 
> Maybe WmIoRead() should check, that the Provider is not being released and
> take some reference, while WmProviderRemoveHandler() should wait to this
> reference to be removed ?

The provider object is (supposed to be) bound to the lifetime of the open ControlDevice file.  It is initialized in the EvtFileCreate callback and released in the EvtFileCleanup callback.  According to the MS documentation, the EvtFileCleanup is called after the last handle to the file has been closed.  My assumption was that this meant that the file is no longer accessible for any other access (ioctls, reads, or writes).

There is a vague note in the documentation that states: "(Because of outstanding I/O requests, this handle might not have been released.)"  I have no idea what exactly this means.  If it means that Windows may invoke calls on a file during or after calling the EvtFileCleanup, then Windows is seriously stupid.

As a simple test, we can *try* adding checks in wm_driver.c in WmIoDeviceControl(), WmIoRead(), and WmIoWrite() that do something like:

	if (prov->Ref == 0) {
		WdfRequestComplete(Request, STATUS_WINDOWS_IS_STUPID);
		return;
	}
 
(A better solution may be to call WmProviderGet() / WmProviderPut(), with WmProviderGet() returning whether or not we actually obtained the provider.)  What we really need to determine is whether Windows will invoke calls on a file during or after calling the cleanup event callback, but I have no idea how we can know that.  And if it does, is it a 'feature' or a bug.  If windows does not do this, then the check above isn't a safe fix, since it depends on the prov memory being accessible.

- Sean



More information about the ofw mailing list