[ofa-general] ib_mthca catastrophic error detected
    Scott A. Friedman 
    friedman at ucla.edu
       
    Mon Oct 27 18:01:17 PDT 2008
    
    
  
Hello
On a several hundred node cluster we run here we have experienced 
several large (512+ core) job die with the following left in several of 
the node's logs. Below is an example from two different nodes. 22 nodes 
had this error after the large run died.
What is this error and why would be seeing it. I looked through this 
list and only came across a couple of mentions but no real explanation.
node example A:
ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
ib_mthca 0000:02:00.0:   buf[00]: 0012f6f8
ib_mthca 0000:02:00.0:   buf[01]: 00000000
ib_mthca 0000:02:00.0:   buf[02]: 00000000
ib_mthca 0000:02:00.0:   buf[03]: 00000000
ib_mthca 0000:02:00.0:   buf[04]: 00000000
ib_mthca 0000:02:00.0:   buf[05]: 0012f6dc
ib_mthca 0000:02:00.0:   buf[06]: 001b3714
ib_mthca 0000:02:00.0:   buf[07]: 00000000
ib_mthca 0000:02:00.0:   buf[08]: 00000000
ib_mthca 0000:02:00.0:   buf[09]: 00000000
ib_mthca 0000:02:00.0:   buf[0a]: 00000000
ib_mthca 0000:02:00.0:   buf[0b]: 00000000
ib_mthca 0000:02:00.0:   buf[0c]: 00000000
ib_mthca 0000:02:00.0:   buf[0d]: 00000000
ib_mthca 0000:02:00.0:   buf[0e]: 00000000
ib_mthca 0000:02:00.0:   buf[0f]: 00000000
node example B:
ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
ib_mthca 0000:02:00.0:   buf[00]: 0012bb7c
ib_mthca 0000:02:00.0:   buf[01]: 00000000
ib_mthca 0000:02:00.0:   buf[02]: 00000000
ib_mthca 0000:02:00.0:   buf[03]: 00000000
ib_mthca 0000:02:00.0:   buf[04]: 00000000
ib_mthca 0000:02:00.0:   buf[05]: 0012bb5c
ib_mthca 0000:02:00.0:   buf[06]: 001905a0
ib_mthca 0000:02:00.0:   buf[07]: 00000000
ib_mthca 0000:02:00.0:   buf[08]: 00000000
ib_mthca 0000:02:00.0:   buf[09]: 00000000
ib_mthca 0000:02:00.0:   buf[0a]: 00000000
ib_mthca 0000:02:00.0:   buf[0b]: 00000000
ib_mthca 0000:02:00.0:   buf[0c]: 00000000
ib_mthca 0000:02:00.0:   buf[0d]: 00000000
ib_mthca 0000:02:00.0:   buf[0e]: 00000000
ib_mthca 0000:02:00.0:   buf[0f]: 00000000
    
    
More information about the general
mailing list