[ofa-general] ib_mthca catastrophic error detected
Tziporet Koren
tziporet at dev.mellanox.co.il
Tue Oct 28 05:27:58 PDT 2008
Scott A. Friedman wrote:
> Hello
>
> On a several hundred node cluster we run here we have experienced
> several large (512+ core) job die with the following left in several
> of the node's logs. Below is an example from two different nodes. 22
> nodes had this error after the large run died.
>
> What is this error and why would be seeing it. I looked through this
> list and only came across a couple of mentions but no real explanation.
>
> node example A:
>
> ib_mthca 0000:02:00.0: Catastrophic error detected: internal error
>
Can you specify:
Which OFED version you use? (or IB from kernel.org)
Which HCA and FW version?
Tziporet
More information about the general
mailing list