[ofa-general] ib_mthca catastrophic error detected
John Valdes
valdes at anl.gov
Wed May 27 13:07:11 PDT 2009
All,
I had posted last week about SRP problems we've been having after
upgrading from some old servers running RHEL 5.1 to new servers
running RHEL 5.3. We're still trying to isolate the cause of the
problems, but one of the symptoms we're seeing is that occasionally
when under stress (well, if you can call doing a "dd" from /dev/zero
to the SRP target stress...), the ib_mthca driver will report a
"Catastrophic error":
ib_mthca 0000:04:00.0: Catastrophic error detected: internal error
host8: ib_srp: failed receive status 4
ib_srp: host8: add qp_in_err timer
host8: ib_srp: failed receive status 5
ib_mthca 0000:04:00.0: buf[00]: 00000000
ib_mthca 0000:04:00.0: buf[01]: 00000000
ib_mthca 0000:04:00.0: buf[02]: 00000000
ib_mthca 0000:04:00.0: buf[03]: 00000000
ib_mthca 0000:04:00.0: buf[04]: 00000000
ib_mthca 0000:04:00.0: buf[05]: 00000000
ib_mthca 0000:04:00.0: buf[06]: 00000000
ib_mthca 0000:04:00.0: buf[07]: 00000000
ib_mthca 0000:04:00.0: buf[08]: 00000000
ib_mthca 0000:04:00.0: buf[09]: 00000000
ib_mthca 0000:04:00.0: buf[0a]: 00000000
ib_mthca 0000:04:00.0: buf[0b]: 00000000
ib_mthca 0000:04:00.0: buf[0c]: 00000000
ib_mthca 0000:04:00.0: buf[0d]: 00000000
ib_mthca 0000:04:00.0: buf[0e]: 00000000
ib_mthca 0000:04:00.0: buf[0f]: 00000000
host8: ib_srp: srp_qp_in_err_timer called
Checking back through the list archives, the consensus seems to be
that these are due to card problems, usually with the firmware. We've
never had this problem w/ the old servers under RHEL 5.1 w/ the
bundled OFED 1.2, but maybe the new servers and/or the RHEL 5.3 w/
OFED 1.3 is pushing the card harder and/or tickling a bug in the
firmware? The cards are Cisco branded Mellanox Cougar Cub cards;
"tvflash -i" identifies them as:
HCA #0: MT23108, Cougar Cub, revision A1
Primary image is v3.5.917 build 3.2.0.149, with label 'HCA.CougarCub.A1'
Secondary image is v3.3.005 build 3.2.0.67, with label 'HCA.CougarCub.A1'
Vital Product Data
Product Name: Cougar cub
P/N: SFS-HCA-X2T7-A1
E/C: Rev: A0
S/N: CS0636X00286
Freq/Power: PW=12W;PCI 66MHZ;PCI-X 133MHZ
Date Code: 0636
Checksum: Ok
Unfortunately, v3.5.917 seems to be the latest version of the firmware
listed on Cisco's website, at least that I could find.
Is anyone aware of any issues with this version of the firmware?
John
----------------------------------------------------------------------
John Valdes Mathematics and Computer Science Division
valdes at anl.gov Argonne National Laboratory
More information about the general
mailing list