[ofa-general] Catastrophic error on an mthca driver
Ramiro Alba Queipo
raq at cttc.upc.edu
Wed Oct 1 04:21:20 PDT 2008
Hi all,
I recently had a problem with the server card of an infiniband cluster
which in turn made all the fabric down as the opensm daemon had run
into problems. Running dmesg you could see:
--------------------------------------------------------------------
[408188.411258] ib_mthca 0000:0c:00.0: Catastrophic error detected:
internal error
[408188.411266] ib_mthca 0000:0c:00.0: buf[00]: 000d0000
[408188.411269] ib_mthca 0000:0c:00.0: buf[01]: 00000000
[408188.411271] ib_mthca 0000:0c:00.0: buf[02]: 00000000
[408188.411274] ib_mthca 0000:0c:00.0: buf[03]: 00000000
[408188.411276] ib_mthca 0000:0c:00.0: buf[04]: 00000000
[408188.411279] ib_mthca 0000:0c:00.0: buf[05]: 00127e9c
[408188.411281] ib_mthca 0000:0c:00.0: buf[06]: ffffffff
[408188.411283] ib_mthca 0000:0c:00.0: buf[07]: 00000000
[408188.411286] ib_mthca 0000:0c:00.0: buf[08]: 00000000
[408188.411288] ib_mthca 0000:0c:00.0: buf[09]: 00000000
[408188.411290] ib_mthca 0000:0c:00.0: buf[0a]: 00000000
[408188.411292] ib_mthca 0000:0c:00.0: buf[0b]: 00000000
[408188.411295] ib_mthca 0000:0c:00.0: buf[0c]: 00000000
[408188.411297] ib_mthca 0000:0c:00.0: buf[0d]: 00000000
[408188.411299] ib_mthca 0000:0c:00.0: buf[0e]: 00000000
[408188.411302] ib_mthca 0000:0c:00.0: buf[0f]: 00000000
------------------------------------------------------------
Problems get solved once I restarted networking. I mean:
/etc/init.d/networking restart => ifdown -a and then ifup -a
I'd say that this was due to running 'smpquery' but I do not know if
this has too much sense.
Anyway, there are now the following messages running 'dmesg':
---------------------------------------------------------
[417317.088898] ib_mad: Method 1 already in use
[431433.665919] ib_mad: Method 1 already in use
[431533.719671] ib_mad: Method 1 already in use
[438159.301272] ib_mad: Method 1 already in use
[438236.583426] ib_mad: Method 1 already in use
---------------------------------------------------------
I rebooted the server and did a firware update, which did not seem
necessary:
flint -d /dev/mst/mt25204_pci_cr0 -i
jff202/fw-25204-1_2_0-MHGS18-XTC_A5.bin b
Current FW version on flash: 1.2.0
New FW version: 1.2.0
Note: The new FW version is not newer than the current FW version on
flash.
Do you want to continue ? (y/n) [n] : y
Read and verify Invariant Sector - OK
Read and verify PPS/SPS on flash - OK
Burning second FW image without signatures - OK
Restoring second signature - OK
Then I did a verify:
root at jff:~# flint -d /dev/mst/mt25204_pci_cr0 v
Failsafe image:
Invariant /0x00000028-0x00000953 (0x00092c)/ (BOOT2) - OK
Primary Pointer Sector /0x00010000/ - invalid signature (00000000)
Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK
/0x00090028-0x0009086f (0x000848)/ (BOOT2) - OK
/0x00090870-0x000945ff (0x003d90)/ (BOOT2) - OK
/0x00094600-0x0009515f (0x000b60)/ (Configuration) - OK
/0x00095160-0x00095193 (0x000034)/ (GUID) - OK
/0x00095194-0x000951db (0x000048)/ (Image Info) - OK
/0x000951dc-0x0009525b (0x000080)/ (DDR) - OK
/0x0009525c-0x000a8e2f (0x013bd4)/ (DDR) - OK
/0x000a8e30-0x000a8eaf (0x000080)/ (DDR) - OK
/0x000a8eb0-0x000aaebb (0x00200c)/ (DDR) - OK
/0x000aaebc-0x000aaf3b (0x000080)/ (DDR) - OK
/0x000aaf3c-0x000e147b (0x036540)/ (DDR) - OK
/0x000e147c-0x000e148f (0x000014)/ (Configuration) - OK
/0x000e1490-0x000e14d3 (0x000044)/ (Jump addresses) - OK
/0x000e14d4-0x000e16db (0x000208)/ (FW Configuration) -
OK
FW image verification succeeded. Image is bootable.
Now I realized that both the card port and the switch port to where this
card is linked, have 'XmtDiscards' (though they do not seem to grow up):
# Port counters: Lid 1 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................2
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................2
XmtData:.........................1043921
RcvData:.........................123107938
XmtPkts:.........................36932
RcvPkts:.........................249752
# Port counters: Lid 4 port 23
PortSelect:......................23
CounterSelect:...................0x0100
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................142
XmtDiscards:.....................199
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtData:.........................649631134
RcvData:.........................6429228
XmtPkts:.........................1345694
RcvPkts:.........................231549
Is this a hardware problem? Is there a way to check for a hardware
problem?
Regards
--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk
More information about the general
mailing list