[ofw] RE: HCA Soft Reset mechanism

Tue Aug 5 02:12:31 PDT 2008

Thank you, guys, for remarks.
Find my answers inline. 

> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at windows.microsoft.com] 
> Sent: Tuesday, August 05, 2008 4:07 AM
> To: Jan Bottorff; Sean Hefty; Leonid Keller; ofw at lists.openfabrics.org
> Subject: RE: [ofw] RE: HCA Soft Reset mechanism
> 
> >  You guys don't seem to be thinking in terms of fabric 
> booted systems.
> > You will not be able to remove the system disk driver 
> instance or any 
> > parent of the system disk, just because the hca is having a 
> problem. 
> > You will need to fail outstanding I/O's, reset/restart the hca, and 
> > get everybody back up and running, so the system disk 
> driver can retry 
> > the I/O. This needs to happen with zero potential for page faults, 
> > which means zero PnP events that might cause page faults.
> >
> > I've always assumed the goal of the OFW IB stack is to 
> create servers 
> > with ONLY an hca for I/O, although perhaps that's an incorrect 
> > assumption.
> 
> Jan's right here.  I'd like to point out that the stack 
> already has the catastrophic error event reporting 
> capabilities.  There's no reason that the whole HCA should be 
> reset when IPoIB gets reset.  The HCA only needs to be reset 
> if it detects a catastrophic error.  

Wait-wait-wait ... :)
Clients are not expected to do reset for fun.
Recall, that they have to stop their own clients and to release all
their resources before reset and to restart all thereafter ...
They do reset upon *incorrect work or freeze of HW*, which WAS NOT
DETECTED by catastrophic error polling thread !
In other words, it is an additional mechanism for solving HW problems.

> Otherwise clients should 
> implement their own recovery logic that works independently 
> from all other clients' recovery logic.

The clients are not *obliged* to use Soft Reset mechanism, at least -
not to start from it.
They can first try more moderate "medicines" like reinitializing stuck
QPs ...

> 
> This seems pretty straight forward to handle too - when an 
> HCA catastrophic error occurs, generate a error callback for 
> all QP/CQ resources.  

It requires from driver to handle book keeping information about all
QP/CQ resources which is unreasonable and unnecessary IMHO.
Clients know their resources and can release them upon reset
notification wothout problems.

> At this point, the old QP/CQ resources 
> are just objects maintained by the driver with no backing 
> hardware resource.  The HCA driver can then reset, 

Yes, but may it ?
And what if clients have some more things to do ?
How can bus driver decide on its own that all clients are ready for
reset ?
For example, for Ethernet, adapters have to wait till NDIS calls
CheckForHang function to notify it about the reset.
Otherwise NDIS will proceed to send packets and to wait for the
completion in a timely manner ...

> and can re-create all PDs/MRs that existed before.  Clients then must 
> handle the error for their CQ/QPs to get back to functioning state.
> 
> Of course, it's probably a lot more complicated than this, 
> but it doesn't require this grand scheme 

It is not that grand in my opinion.
Registration/deregistration of event callbacks was added beforehand and
is used for propagating of any events, e.g. PORT_UP/PORT_DOWN.
I've added only two new functions:
	mlx4_reset_request - client request to perform reset;
	mlx4_reset_execute - client notification that it is ready for
reset.
and two new events 
	IB_EVENT_RESET - "reset pending"
	IB_EVENT_RESET_END - "reset finished"

> of coordinating all clients

You have to coordinate them.
If a client *is not working* at that moment he won't get reset
indication in your scheme, which is incorrect.

> and doesn't require any changes to the usage model 
> from what is there today.

It is just a small two-way interface.
We have today a bigger one between HCA and IBAL.

> 
> -Fab
>