[ofw] RE: HCA Soft Reset mechanism

Tue Aug 5 02:47:09 PDT 2008

See inline 

> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com] 
> Sent: Monday, August 04, 2008 9:45 PM
> To: Leonid Keller; ofw at lists.openfabrics.org
> Subject: RE: HCA Soft Reset mechanism
> 
> >> From the perspective of a client, is the behavior any 
> different than 
> >> a remove device followed by an add device?
> >
> >No.
> 
> Maybe we can leverage the work in this area then...
> 
> >> How does this synchronize with a user unloading the driver or the 
> >> entire stack?
> >
> >A good point.
> >There is no special synchronization.
> >But there is also no syncronization between data transfer and PnP 
> >events like driver unload.
> 
> Synchronization between data transfers and driver unload is 
> handled by the stack from the top down.  By the time the 
> driver unload reaches the driver at the bottom of the stack, 
> everything above it will have stopped initiating data 
> transfers and released all resources.
> 
> The only special handling that ends up being needed are 
> drivers that export userspace interfaces.  Since userspace 
> can't be trusted to stop using the hardware, those drivers 
> need the functionality that you're describing.  They need to 
> block additional userspace access and cleanup any hardware 
> resources.  This is non-trivial to handle, and we want to 
> avoid putting this sort of complexity in every driver in the stack.
> 

I forgot to mention that all this Soft Reset stuff is only in kernel.
User clients are not notified (at this stage).

> Is there any way to issue standard PnP add/remove device, or 
> maybe power off/on, or a custom PnP event?  My concern is 
> that the upper level drivers have something easier to key off 
> of, rather than callbacks which could occur simultaneous with 
> other PnP events.

I'd say, this is not what we want.
We want to solve HW problem as quiet as possible, without dismantling
all the above driver stack.

> 
> >> Why do clients need to get new interfaces or re-register event 
> >> handlers?
> >
> >Because ib_device was re-created and it is the main parameter of the 
> >interface.
> 
> I think I'm missing something.  How do the drivers know when 
> to re-register, or when the hardware can be used again?

The scheme is as follows:
	- (optionally) client calls 'mlx4_reset_request' to request soft
reset; 
	- driver calls event callback with IB_EVENT_RESET to start the
reset process;
	- client prepares itself to it and calls 'mlx4_reset_execute' to
proceed;
	- driver performs the reset and calls event callback with
IB_EVENT_RESET_END; 
	- client gets new interface and restart proceeds its work;
> 
> >It is really a usual way to handle serious HW problems.
> >Say network adapter reveals, that the card doesn't work in 
> an expected 
> >manner.
> >It notifies NDIS via CheckForHang function, that it wants to 
> be reset.
> >NDIS calls network adapter' Reset function to initialize HW 
> and solve 
> >this way the problem.
> 
> I agree with the functionality.  I'm just trying to figure 
> out the impact on the stack.  For mlx4, it seems like the bus 
> driver could remove the PDO, wait for the device removal to 
> complete, reset the adapter, then add back the PDO.  Maybe 
> there's more to it than this, but at least on the surface, 
> this seems easy and works without changes to the rest of the 
> stack.  I'm not sure how you'd handle mthca though, but maybe 
> getting this to work for mthca isn't as important...?
> 
> - Sean
> 
>