[ofa-general] ib_mthca catastrophic error detected

Tziporet Koren tziporet at dev.mellanox.co.il
Tue Oct 28 05:27:58 PDT 2008


Scott A. Friedman wrote:
> Hello
>
> On a several hundred node cluster we run here we have experienced 
> several large (512+ core) job die with the following left in several 
> of the node's logs. Below is an example from two different nodes. 22 
> nodes had this error after the large run died.
>
> What is this error and why would be seeing it. I looked through this 
> list and only came across a couple of mentions but no real explanation.
>
> node example A:
>
> ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
>
Can you specify:
Which OFED version you use? (or IB from kernel.org)
Which HCA and FW version?

Tziporet






More information about the general mailing list