[Openib-windows] Crash on driver unload

Fabian Tillier ftillier at silverstorm.com
Tue May 23 07:58:24 PDT 2006


Hi Leo,

On 5/23/06, Leonid Keller <leonid at mellanox.co.il> wrote:
> Hi, Fab.
> I've got a blue screen on post_send operation of a test.
> The direct reason was the fact that the driver was at that moment found
> at the mthca_cmd_cleanup() line of mthca_remove_one()function, waiting
> for a command and having released almost all the resources.
> mthca_remove_one() was called from __PowerDownCb().
>
> This rises two questions:
>
> 1. (a minor one) Why could driver receive a SET_POWER request ?
> The computer was not being powered down. The only action, related to the
> crash - and I'm not sure that it was really done - was disabling of
> ibbus.sys driver.
> Could it somehow cause sending of a SET_POWER request ?

I don't know.  If the HCA was being disabled this might make sense.

> 2. (a major one) Why do we have the crash ?
> Look, please, into __PowerDownCb() (the same happens also in
> hca_release_resources()):
> I, first, call p_ext->ci_ifc.deregister_ca() and then immediately start
> a synchronous releasing of the driver resources.
> But as far as I saw, the call to deregister_ca() just starts an
> asynchronous process of closing CA instance, which doesn't prevent  - at
> least for some time - the work of kernel clients.

deregister_ca blocks until all clients have closed.  The CI CA object
is created as a sync destroy object, so it won't complete destruction
until all references have been released.  As long as there are open
handles to the CA, the CI CA object should not be able to go away.

Is it possible the client is closing its CA handle from the PnP
callback, but continuing to use the QP handle?  Does the test register
for PnP callbacks?  How does it handle them?

> I also saw a following comment in the ib_deregister_ca().
>         /* TODO: Before destroying, do a query PnP call and return
> IB_BUSY as needed. */

It would be nice if we gave users the ability to veto a CA removal,
similar to how QUERY_REMOVE IRPs are handled.  As it stands, if a user
gets a CA_REMOVE event, they *must* close the CA.  However, adding
support for IB_BUSY in the CA removal requires more work, since a
REMOVE IRP cannot be failed.  So we'd need to forward the QUERY_REMOVE
IRP, and I'd rather not invent new mechanisms for doing this when the
OS provides these PnP events to both kernel and user clients.

> We need to do here several things, IMO:
>        - deregister_ca() has, before starting the closing of CA, to set
> some flag, preventing any further work with the driver;
>        - this flag must be checked in all the verbs;

Adding a flag doesn't fix the problem.  If you want to handle users
making calls with invalid handles, you need to do a proper lookup of
the handle (like is done for user-mode).  The lookup would then
validate the handle, take a reference on its underlying object,
perform the requested operation, and release the reference.  Doing
such a lookup requires a global handle table and a global lock, both
of which should be avoided.

- Fab



More information about the ofw mailing list