[ofw] RE: System shutdown failure - living on the edge

Fri Apr 17 10:12:16 PDT 2009

Hello Fab,

Ah, sins of the past....Filter driver conversion.

At this time I'm really only concerned about correct shutdown behavior [No asserts or traps into windbg].

I did a little experimentation w.r.t. treating System Shutdown the same as HCA removal/disable.
With a little refactoring fd_remove_resources() now calls
bus_remove_resources(p_dev_obj,shutdown).
A 'shutdown' boolean of TRUE does not cl_thread_suspend() as it's not good to pause/sleep in the power path.
FALSE allows a cl_thread_suspend() after al_cleanup(), to provide AL async threads more of a chance to run and realize the HCA has gone away; speeds up devmgmt.msc device view refresh.

 fdo_set_power() on [SystemPowerState && ShutDownType == PowerActionShutdown*] calls bus_remove_resources(p_dev_obj,TRUE).
No longer are the AL async threads attempting to send on an HCA which is shutting down.
System shuts down/restarts gracefully with no problems (+/- WSD).

Further experimentation is needed to fully understand if the cl_thread_suspend() is truly needed in the fdo_release_resources() path. In bus_release_resources() the HCA reference release is delayed until after the al_cleanup() for the last standing HCA in the belief al_cleanup() wants the HCA to stick around until it's finished; this may provide the required serialization such that cl_thread_suspend() is not required.

Film @ eleven....

Stan.

Fab Tillier wrote:
>> What suggestions might you provide to repair this situation such that
>> the IBAL async threads are shutdown prior to passing the POWER IRP
>> down to the HCA driver.
>
> What used to happen before ibbus became a filter driver is that the
> HCA driver would treat powering off (moving out of D0) the same was
> as HCA removal.  At this point, it's your only/best bet.
>
> Longer term, it would be great if the HCA driver kept state and
> restored what it could when it powers back up.  Much of the state can
> be restored - all resource can be re-allocated, all pending work
> requests can be completed, all completed CQEs can be reaped, etc.
> What you can't restore are connections for RC and UD QPs, and for
> those the HCA driver could generate a QP fatal error, or flush all
> posted work requests.
>
> This would require significant rework of the HCA driver, so it's not
> an easy fix, but it would allow for things to not have to be fully
> torn down/recreated.
>
> -Fab