[openib-general] Catastrophic error detected.

Dotan Barak dotanb at dev.mellanox.co.il
Thu Oct 19 08:26:17 PDT 2006


Hi Ira.

Ira Weiny wrote:
> I got the following error running with OFED 1.1 on a modified 2.6.9 RHEL4
> kernel.  Hal mentioned that there might be a catastrophic error recovery patch
> submitted since then?  I can't find a mention of that in the mailing list.  If
> possible I would like to try such a patch.
>
> Thanks,
> Ira
>
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0: Catastrophic error detected: unknown error
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[00]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[01]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[02]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[03]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[04]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[05]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[06]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[07]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[08]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[09]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[0a]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[0b]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[0c]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[0d]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[0e]: ffffffff
> 2006-10-17 21:31:47 ib_mthca 0000:07:00.0:   buf[0f]: ffffffff
>
> # rhea277 /root > /sbin/lspci -vv -s 07:00.0
> 07:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (rev 20)
>         Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
>         Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
>         Interrupt: pin A routed to IRQ 217
>         Region 0: Memory at dff00000 (64-bit, non-prefetchable) [disabled] [size=1M]
>         Region 2: Memory at de800000 (64-bit, prefetchable) [disabled] [size=8M]
>         Capabilities: [40] Power Management version 2
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [48] Vital Product Data
>         Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
>                 Address: 0000000000000000  Data: 0000
>         Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>                 Vector table: BAR=0 offset=00082000
>                 PBA: BAR=0 offset=00082200
>         Capabilities: [60] Express Endpoint IRQ 0
>                 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
>                 Device: Latency L0s <64ns, L1 unlimited
>                 Device: AtnBtn- AtnInd- PwrInd-
>                 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                 Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
>                 Link: Latency L0s unlimited, L1 unlimited
>                 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
>                 Link: Speed 2.5Gb/s, Width x8
>   

can you please give me some info on how you got this error:
* what did you do that caused this error?
* which FW version do you have?
* what is the board_id of the HCA? (you can find this info using 
ibv_devinfo)

thanks
Dotan




More information about the general mailing list