[ofa-general] ib_mthca catastrophic error detected

John Valdes valdes at anl.gov
Wed May 27 13:07:11 PDT 2009


All,

I had posted last week about SRP problems we've been having after
upgrading from some old servers running RHEL 5.1 to new servers
running RHEL 5.3.  We're still trying to isolate the cause of the
problems, but one of the symptoms we're seeing is that occasionally
when under stress (well, if you can call doing a "dd" from /dev/zero
to the SRP target stress...), the ib_mthca driver will report a
"Catastrophic error":

  ib_mthca 0000:04:00.0: Catastrophic error detected: internal error
   host8: ib_srp: failed receive status 4
  ib_srp:  host8: add qp_in_err timer
   host8: ib_srp: failed receive status 5
  ib_mthca 0000:04:00.0:   buf[00]: 00000000
  ib_mthca 0000:04:00.0:   buf[01]: 00000000
  ib_mthca 0000:04:00.0:   buf[02]: 00000000
  ib_mthca 0000:04:00.0:   buf[03]: 00000000
  ib_mthca 0000:04:00.0:   buf[04]: 00000000
  ib_mthca 0000:04:00.0:   buf[05]: 00000000
  ib_mthca 0000:04:00.0:   buf[06]: 00000000
  ib_mthca 0000:04:00.0:   buf[07]: 00000000
  ib_mthca 0000:04:00.0:   buf[08]: 00000000
  ib_mthca 0000:04:00.0:   buf[09]: 00000000
  ib_mthca 0000:04:00.0:   buf[0a]: 00000000
  ib_mthca 0000:04:00.0:   buf[0b]: 00000000
  ib_mthca 0000:04:00.0:   buf[0c]: 00000000
  ib_mthca 0000:04:00.0:   buf[0d]: 00000000
  ib_mthca 0000:04:00.0:   buf[0e]: 00000000
  ib_mthca 0000:04:00.0:   buf[0f]: 00000000
   host8: ib_srp: srp_qp_in_err_timer called

Checking back through the list archives, the consensus seems to be
that these are due to card problems, usually with the firmware.  We've
never had this problem w/ the old servers under RHEL 5.1 w/ the
bundled OFED 1.2, but maybe the new servers and/or the RHEL 5.3 w/
OFED 1.3 is pushing the card harder and/or tickling a bug in the
firmware?  The cards are Cisco branded Mellanox Cougar Cub cards;
"tvflash -i" identifies them as:

  HCA #0: MT23108, Cougar Cub, revision A1
    Primary image is v3.5.917 build 3.2.0.149, with label 'HCA.CougarCub.A1'
    Secondary image is v3.3.005 build 3.2.0.67, with label 'HCA.CougarCub.A1'

    Vital Product Data
      Product Name: Cougar cub
      P/N: SFS-HCA-X2T7-A1
      E/C: Rev: A0
      S/N: CS0636X00286
      Freq/Power: PW=12W;PCI 66MHZ;PCI-X 133MHZ
      Date Code: 0636
      Checksum: Ok

Unfortunately, v3.5.917 seems to be the latest version of the firmware
listed on Cisco's website, at least that I could find.

Is anyone aware of any issues with this version of the firmware?

John

----------------------------------------------------------------------
John Valdes                  Mathematics and Computer Science Division
valdes at anl.gov                             Argonne National Laboratory



More information about the general mailing list