[Openib-windows] System reboot testing

Thu Apr 27 07:45:37 PDT 2006

Hi Jan,

On 4/27/06, Jan Bottorff <jbottorff at xsigo.com> wrote:
>
> Hi,
>
> I was doing some testing on build 330, and see the assert on system shutdown
> is still there. Would one of you folks doing development like my script that
> reboots a remote machine every 7 minutes? This was a 32-bit debug build
> running on W2K3 SP1 full checked. It offhand looked like the IB stack
> received a power shutdown irp while there were still open references to
> objects. It might be an ordering issue of delivery of power irps. It seems
> to take 20 reboots on my test system to hit this, which is a bit better than
> in the older release we are using (which I've never seen reboot more than
> about 10 times without getting stuck).
>
> I'm told by Microsoft to be prepared for drivers in different PnP
> hierarchies to at times be shut down in parallel in multiple threads and at
> times to only have a single system wide thread doing the shutdown. I believe
> there is code in the IB drivers that causes the power irp thread to block if
> there are outstanding references, on the assumption some other thread will
> release those references. In the case of only a single thread, or the case
> the IB driver happens to be assigned to the same thread as the driver that
> has the references, the strategy of waiting for the references to go away
> will not be effective. Setting a system to single processor mode might help
> reproduce this as may the checked OS. I suspect the real root cause is the
> IBAL driver being root enumerated means there is no shutdown power irp
> ordering between it and the HCA driver stack.

Power management in general is pretty broken right now.  However, the
MTHCA driver for Windows I believe handles it better (though still not
quite right).

The problem with the HCA is that it is stateless.  If you power it
down, all resources that were allocated disapear.  There needs to be a
lot of changes done to the HCA driver so that it restores at least
some context so that users can continue to use the resources they
allocated before the HCA was moved out of D0.

The way it's handled right now is that any request to move the HCA out
of D0 is treated just like as if the HCA was being removed from the
system.  To make things worse, the MT23108 driver handles power
management requests in the context of the power IRP, and makes a
blocking call.  This in itself is a flaw since power IRPs could come
down at DISPATCH_LEVEL if any device had a special file (page,
hibernation, or crashdump).

The Windows MTHCA driver queues an I/O work item to process the power
management request.  Could you give that driver a shot and see if it
still exhibits the same issues?  We'll be transitioning away from the
MT23108 driver shortly.

Thanks,

- Fab