[ofa-general] Help with an MTHCA "catastrophe"

Tziporet Koren tziporet at dev.mellanox.co.il
Wed Apr 4 08:37:26 PDT 2007


> The following is console output immediately before a panic on a system
> running lustre with OFED 1.1.  How can I find out what it
> means?  
>
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected:
> internal error
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[00]: 001d79f4
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[01]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[02]: 00198538
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[03]: 00136038
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[04]: 00207730
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[05]: 001d79cc
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[06]: 0023cf24
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[07]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[08]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[09]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0a]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0b]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0c]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0d]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0e]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0f]: 00000000
>
> ...shortly before it happens, the lustre/lnet OFED driver receives a
> number of what I believe to be duplicate SEND completion
> events.  It seems quite sporadic, and doesn't appear to track hardware.
>
>   
Please contact your HCA provider to get a FW version that fix this issue.

Tziporet



More information about the general mailing list