[ofw] RE: HCA Soft Reset mechanism

Mon Aug 4 18:07:11 PDT 2008

>  You guys don't seem to be thinking in terms of fabric booted systems.
> You will not be able to remove the system disk driver instance or any
> parent of the system disk, just because the hca is having a problem. You
> will need to fail outstanding I/O's, reset/restart the hca, and get
> everybody back up and running, so the system disk driver can retry the
> I/O. This needs to happen with zero potential for page faults, which
> means zero PnP events that might cause page faults.
>
> I've always assumed the goal of the OFW IB stack is to create servers
> with ONLY an hca for I/O, although perhaps that's an incorrect
> assumption.

Jan's right here.  I'd like to point out that the stack already has the catastrophic error event reporting capabilities.  There's no reason that the whole HCA should be reset when IPoIB gets reset.  The HCA only needs to be reset if it detects a catastrophic error.  Otherwise clients should implement their own recovery logic that works independently from all other clients' recovery logic.

This seems pretty straight forward to handle too - when an HCA catastrophic error occurs, generate a error callback for all QP/CQ resources.  At this point, the old QP/CQ resources are just objects maintained by the driver with no backing hardware resource.  The HCA driver can then reset, and can re-create all PDs/MRs that existed before.  Clients then must handle the error for their CQ/QPs to get back to functioning state.

Of course, it's probably a lot more complicated than this, but it doesn't require this grand scheme of coordinating all clients and doesn't require any changes to the usage model from what is there today.

-Fab