[ofw] RE: HCA Soft Reset mechanism
Leonid Keller
leonid at mellanox.co.il
Tue Aug 5 02:47:09 PDT 2008
See inline
> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com]
> Sent: Monday, August 04, 2008 9:45 PM
> To: Leonid Keller; ofw at lists.openfabrics.org
> Subject: RE: HCA Soft Reset mechanism
>
> >> From the perspective of a client, is the behavior any
> different than
> >> a remove device followed by an add device?
> >
> >No.
>
> Maybe we can leverage the work in this area then...
>
> >> How does this synchronize with a user unloading the driver or the
> >> entire stack?
> >
> >A good point.
> >There is no special synchronization.
> >But there is also no syncronization between data transfer and PnP
> >events like driver unload.
>
> Synchronization between data transfers and driver unload is
> handled by the stack from the top down. By the time the
> driver unload reaches the driver at the bottom of the stack,
> everything above it will have stopped initiating data
> transfers and released all resources.
>
> The only special handling that ends up being needed are
> drivers that export userspace interfaces. Since userspace
> can't be trusted to stop using the hardware, those drivers
> need the functionality that you're describing. They need to
> block additional userspace access and cleanup any hardware
> resources. This is non-trivial to handle, and we want to
> avoid putting this sort of complexity in every driver in the stack.
>
I forgot to mention that all this Soft Reset stuff is only in kernel.
User clients are not notified (at this stage).
> Is there any way to issue standard PnP add/remove device, or
> maybe power off/on, or a custom PnP event? My concern is
> that the upper level drivers have something easier to key off
> of, rather than callbacks which could occur simultaneous with
> other PnP events.
I'd say, this is not what we want.
We want to solve HW problem as quiet as possible, without dismantling
all the above driver stack.
>
> >> Why do clients need to get new interfaces or re-register event
> >> handlers?
> >
> >Because ib_device was re-created and it is the main parameter of the
> >interface.
>
> I think I'm missing something. How do the drivers know when
> to re-register, or when the hardware can be used again?
The scheme is as follows:
- (optionally) client calls 'mlx4_reset_request' to request soft
reset;
- driver calls event callback with IB_EVENT_RESET to start the
reset process;
- client prepares itself to it and calls 'mlx4_reset_execute' to
proceed;
- driver performs the reset and calls event callback with
IB_EVENT_RESET_END;
- client gets new interface and restart proceeds its work;
>
> >It is really a usual way to handle serious HW problems.
> >Say network adapter reveals, that the card doesn't work in
> an expected
> >manner.
> >It notifies NDIS via CheckForHang function, that it wants to
> be reset.
> >NDIS calls network adapter' Reset function to initialize HW
> and solve
> >this way the problem.
>
> I agree with the functionality. I'm just trying to figure
> out the impact on the stack. For mlx4, it seems like the bus
> driver could remove the PDO, wait for the device removal to
> complete, reset the adapter, then add back the PDO. Maybe
> there's more to it than this, but at least on the surface,
> this seems easy and works without changes to the rest of the
> stack. I'm not sure how you'd handle mthca though, but maybe
> getting this to work for mthca isn't as important...?
>
> - Sean
>
>
More information about the ofw
mailing list