[ofa-general] ib_mthca catastrophic error detected

Scott A. Friedman friedman at ucla.edu
Mon Oct 27 18:01:17 PDT 2008


Hello

On a several hundred node cluster we run here we have experienced 
several large (512+ core) job die with the following left in several of 
the node's logs. Below is an example from two different nodes. 22 nodes 
had this error after the large run died.

What is this error and why would be seeing it. I looked through this 
list and only came across a couple of mentions but no real explanation.

node example A:

ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
ib_mthca 0000:02:00.0:   buf[00]: 0012f6f8
ib_mthca 0000:02:00.0:   buf[01]: 00000000
ib_mthca 0000:02:00.0:   buf[02]: 00000000
ib_mthca 0000:02:00.0:   buf[03]: 00000000
ib_mthca 0000:02:00.0:   buf[04]: 00000000
ib_mthca 0000:02:00.0:   buf[05]: 0012f6dc
ib_mthca 0000:02:00.0:   buf[06]: 001b3714
ib_mthca 0000:02:00.0:   buf[07]: 00000000
ib_mthca 0000:02:00.0:   buf[08]: 00000000
ib_mthca 0000:02:00.0:   buf[09]: 00000000
ib_mthca 0000:02:00.0:   buf[0a]: 00000000
ib_mthca 0000:02:00.0:   buf[0b]: 00000000
ib_mthca 0000:02:00.0:   buf[0c]: 00000000
ib_mthca 0000:02:00.0:   buf[0d]: 00000000
ib_mthca 0000:02:00.0:   buf[0e]: 00000000
ib_mthca 0000:02:00.0:   buf[0f]: 00000000

node example B:

ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
ib_mthca 0000:02:00.0:   buf[00]: 0012bb7c
ib_mthca 0000:02:00.0:   buf[01]: 00000000
ib_mthca 0000:02:00.0:   buf[02]: 00000000
ib_mthca 0000:02:00.0:   buf[03]: 00000000
ib_mthca 0000:02:00.0:   buf[04]: 00000000
ib_mthca 0000:02:00.0:   buf[05]: 0012bb5c
ib_mthca 0000:02:00.0:   buf[06]: 001905a0
ib_mthca 0000:02:00.0:   buf[07]: 00000000
ib_mthca 0000:02:00.0:   buf[08]: 00000000
ib_mthca 0000:02:00.0:   buf[09]: 00000000
ib_mthca 0000:02:00.0:   buf[0a]: 00000000
ib_mthca 0000:02:00.0:   buf[0b]: 00000000
ib_mthca 0000:02:00.0:   buf[0c]: 00000000
ib_mthca 0000:02:00.0:   buf[0d]: 00000000
ib_mthca 0000:02:00.0:   buf[0e]: 00000000
ib_mthca 0000:02:00.0:   buf[0f]: 00000000



More information about the general mailing list