[ofa-general] ib_mthca catastrophic error detected
Scott A. Friedman
friedman at ucla.edu
Mon Oct 27 18:01:17 PDT 2008
Hello
On a several hundred node cluster we run here we have experienced
several large (512+ core) job die with the following left in several of
the node's logs. Below is an example from two different nodes. 22 nodes
had this error after the large run died.
What is this error and why would be seeing it. I looked through this
list and only came across a couple of mentions but no real explanation.
node example A:
ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
ib_mthca 0000:02:00.0: buf[00]: 0012f6f8
ib_mthca 0000:02:00.0: buf[01]: 00000000
ib_mthca 0000:02:00.0: buf[02]: 00000000
ib_mthca 0000:02:00.0: buf[03]: 00000000
ib_mthca 0000:02:00.0: buf[04]: 00000000
ib_mthca 0000:02:00.0: buf[05]: 0012f6dc
ib_mthca 0000:02:00.0: buf[06]: 001b3714
ib_mthca 0000:02:00.0: buf[07]: 00000000
ib_mthca 0000:02:00.0: buf[08]: 00000000
ib_mthca 0000:02:00.0: buf[09]: 00000000
ib_mthca 0000:02:00.0: buf[0a]: 00000000
ib_mthca 0000:02:00.0: buf[0b]: 00000000
ib_mthca 0000:02:00.0: buf[0c]: 00000000
ib_mthca 0000:02:00.0: buf[0d]: 00000000
ib_mthca 0000:02:00.0: buf[0e]: 00000000
ib_mthca 0000:02:00.0: buf[0f]: 00000000
node example B:
ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
ib_mthca 0000:02:00.0: buf[00]: 0012bb7c
ib_mthca 0000:02:00.0: buf[01]: 00000000
ib_mthca 0000:02:00.0: buf[02]: 00000000
ib_mthca 0000:02:00.0: buf[03]: 00000000
ib_mthca 0000:02:00.0: buf[04]: 00000000
ib_mthca 0000:02:00.0: buf[05]: 0012bb5c
ib_mthca 0000:02:00.0: buf[06]: 001905a0
ib_mthca 0000:02:00.0: buf[07]: 00000000
ib_mthca 0000:02:00.0: buf[08]: 00000000
ib_mthca 0000:02:00.0: buf[09]: 00000000
ib_mthca 0000:02:00.0: buf[0a]: 00000000
ib_mthca 0000:02:00.0: buf[0b]: 00000000
ib_mthca 0000:02:00.0: buf[0c]: 00000000
ib_mthca 0000:02:00.0: buf[0d]: 00000000
ib_mthca 0000:02:00.0: buf[0e]: 00000000
ib_mthca 0000:02:00.0: buf[0f]: 00000000
More information about the general
mailing list