[ofa-general] Catastrophic error on an mthca driver

Ramiro Alba Queipo raq at cttc.upc.edu
Wed Oct 1 04:21:20 PDT 2008


Hi all,

I recently had a problem with the server card of an infiniband cluster
which in turn made all the fabric down as the opensm daemon had run
into problems. Running dmesg you could see:

--------------------------------------------------------------------
[408188.411258] ib_mthca 0000:0c:00.0: Catastrophic error detected:
internal error
[408188.411266] ib_mthca 0000:0c:00.0:   buf[00]: 000d0000
[408188.411269] ib_mthca 0000:0c:00.0:   buf[01]: 00000000
[408188.411271] ib_mthca 0000:0c:00.0:   buf[02]: 00000000
[408188.411274] ib_mthca 0000:0c:00.0:   buf[03]: 00000000
[408188.411276] ib_mthca 0000:0c:00.0:   buf[04]: 00000000
[408188.411279] ib_mthca 0000:0c:00.0:   buf[05]: 00127e9c
[408188.411281] ib_mthca 0000:0c:00.0:   buf[06]: ffffffff
[408188.411283] ib_mthca 0000:0c:00.0:   buf[07]: 00000000
[408188.411286] ib_mthca 0000:0c:00.0:   buf[08]: 00000000
[408188.411288] ib_mthca 0000:0c:00.0:   buf[09]: 00000000
[408188.411290] ib_mthca 0000:0c:00.0:   buf[0a]: 00000000
[408188.411292] ib_mthca 0000:0c:00.0:   buf[0b]: 00000000
[408188.411295] ib_mthca 0000:0c:00.0:   buf[0c]: 00000000
[408188.411297] ib_mthca 0000:0c:00.0:   buf[0d]: 00000000
[408188.411299] ib_mthca 0000:0c:00.0:   buf[0e]: 00000000
[408188.411302] ib_mthca 0000:0c:00.0:   buf[0f]: 00000000
------------------------------------------------------------
Problems get solved once I restarted networking. I mean:

/etc/init.d/networking restart => ifdown -a and then ifup -a

I'd say that this was due to running 'smpquery' but I do not know if
this has too much sense.
Anyway, there are now the following messages running 'dmesg':

---------------------------------------------------------
[417317.088898] ib_mad: Method 1 already in use
[431433.665919] ib_mad: Method 1 already in use
[431533.719671] ib_mad: Method 1 already in use
[438159.301272] ib_mad: Method 1 already in use
[438236.583426] ib_mad: Method 1 already in use
---------------------------------------------------------
I rebooted the server and did a firware update, which did not seem
necessary: 

flint -d /dev/mst/mt25204_pci_cr0 -i
jff202/fw-25204-1_2_0-MHGS18-XTC_A5.bin b



    Current FW version on flash:  1.2.0
    New FW version:               1.2.0

    Note: The new FW version is not newer than the current FW version on
flash.

 Do you want to continue ? (y/n) [n] : y

Read and verify Invariant Sector            - OK
Read and verify PPS/SPS on flash            - OK
Burning second FW image without signatures  - OK  
Restoring second signature                  - OK  



Then I did a verify:




root at jff:~# flint -d /dev/mst/mt25204_pci_cr0 v

Failsafe image:

Invariant       /0x00000028-0x00000953 (0x00092c)/ (BOOT2) - OK

Primary   Pointer Sector /0x00010000/ - invalid signature (00000000)

Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK
                /0x00090028-0x0009086f (0x000848)/ (BOOT2) - OK
                /0x00090870-0x000945ff (0x003d90)/ (BOOT2) - OK
                /0x00094600-0x0009515f (0x000b60)/ (Configuration) - OK
                /0x00095160-0x00095193 (0x000034)/ (GUID) - OK
                /0x00095194-0x000951db (0x000048)/ (Image Info) - OK
                /0x000951dc-0x0009525b (0x000080)/ (DDR) - OK
                /0x0009525c-0x000a8e2f (0x013bd4)/ (DDR) - OK
                /0x000a8e30-0x000a8eaf (0x000080)/ (DDR) - OK
                /0x000a8eb0-0x000aaebb (0x00200c)/ (DDR) - OK
                /0x000aaebc-0x000aaf3b (0x000080)/ (DDR) - OK
                /0x000aaf3c-0x000e147b (0x036540)/ (DDR) - OK
                /0x000e147c-0x000e148f (0x000014)/ (Configuration) - OK
                /0x000e1490-0x000e14d3 (0x000044)/ (Jump addresses) - OK
                /0x000e14d4-0x000e16db (0x000208)/ (FW Configuration) -
OK

FW image verification succeeded. Image is bootable.

Now I realized that both the card port and the switch port to where this
card is linked, have 'XmtDiscards' (though they do not seem to grow up):

# Port counters: Lid 1 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................2
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................2
XmtData:.........................1043921
RcvData:.........................123107938
XmtPkts:.........................36932
RcvPkts:.........................249752

# Port counters: Lid 4 port 23
PortSelect:......................23
CounterSelect:...................0x0100
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................142
XmtDiscards:.....................199
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtData:.........................649631134
RcvData:.........................6429228
XmtPkts:.........................1345694
RcvPkts:.........................231549

Is this a hardware problem? Is there a way to check for a hardware
problem?

Regards





-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk




More information about the general mailing list