[ofa-general] ib_mthca catastrophic error detected
    Tziporet Koren 
    tziporet at dev.mellanox.co.il
       
    Tue Oct 28 05:27:58 PDT 2008
    
    
  
Scott A. Friedman wrote:
> Hello
>
> On a several hundred node cluster we run here we have experienced 
> several large (512+ core) job die with the following left in several 
> of the node's logs. Below is an example from two different nodes. 22 
> nodes had this error after the large run died.
>
> What is this error and why would be seeing it. I looked through this 
> list and only came across a couple of mentions but no real explanation.
>
> node example A:
>
> ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
>
Can you specify:
Which OFED version you use? (or IB from kernel.org)
Which HCA and FW version?
Tziporet
    
    
More information about the general
mailing list