[openib-general] [PATCH/RFC] mthca: report catastrophic errors

Nishanth Aravamudan nacc at us.ibm.com
Mon Oct 24 14:26:58 PDT 2005


On 24.10.2005 [13:54:25 -0700], Roland Dreier wrote:
> I just committed the following patch, which adds some initial support
> for detecting and reporting catastrophic errors reported by Mellanox
> HCAs.  We start a periodic timer which polls the catastrophic error
> reporting buffer in device memory.  If an error is detected, we dump
> the contents of the buffer for port-mortem debugging, and report a
> fatal asynchronous error to higher levels.
> 
> In the future we can try to recover from these errors by resetting the
> device, but this will require some work in higher-level code as well.
> Let's get this in now, so that we at least get catastrophic errors
> reported in logs.
> 
> Comments and criticisms gratefully accepted.
> 
>  - R.
> 
> --- infiniband/hw/mthca/mthca_provider.c	(revision 3852)
> +++ infiniband/hw/mthca/mthca_provider.c	(working copy)

<snip>

> +void mthca_start_catas_poll(struct mthca_dev *dev)
> +{
> +	init_timer(&dev->catas_err.timer);
> +	dev->catas_err.stop = 0;
> +	dev->catas_err.map  = NULL;
> +
> +	if (!request_mem_region(dev->catas_err.addr,
> +				dev->catas_err.size * 4,
> +				DRV_NAME)) {
> +		mthca_warn(dev, "couldn't request catastrophic error region "
> +			   "at 0x%llx/0x%x\n",
> +			   (unsigned long long) dev->catas_err.addr,
> +			   dev->catas_err.size * 4);
> +		return;
> +	}
> +
> +	dev->catas_err.map = ioremap(dev->catas_err.addr, dev->catas_err.size * 4);
> +	if (!dev->catas_err.map) {
> +		mthca_warn(dev, "couldn't map catastrophic error region "
> +			   "at 0x%llx/0x%x\n",
> +			   (unsigned long long) dev->catas_err.addr,
> +			   dev->catas_err.size * 4);
> +		release_mem_region(dev->catas_err.addr,
> +				   dev->catas_err.size * 4);
> +		return;
> +	}
> +
> +	dev->catas_err.timer.data     = (unsigned long) dev;
> +	dev->catas_err.timer.function = poll_catas;
> +	dev->catas_err.timer.expires  = jiffies + MTHCA_CATAS_POLL_INTERVAL;

I know akpm has been harping on this only recently (I have yet to audit
all the kernel, but will get around to it eventually), but these three
inits can be done via setup_timer() now.

Thanks,
Nish



More information about the general mailing list