[ofa-general] ib_mthca Catastrophic errors
Pawel Dziekonski
dzieko at wcss.pl
Fri Jun 5 05:40:33 PDT 2009
Hi,
from time to time I get Catastrophic errors like below. software stack is
kernel 2.6.18-92.1.10.el5 with Lustre client. device and OFED info is also
below.
any hints?
thanks in advance, Pawel
06:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
# ibv_devices
device node GUID
------ ----------------
mthca0 0030487e07700000
# ibv_devinfo
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0030:487e:0770:0000
sys_image_guid: 0030:487e:0770:0003
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: SM_0000000003
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 441
port_lmc: 0x00
# ofed_info
OFED-1.3.1
libibverbs:
git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3
commit 40b771aa6a9c0ad092b2e20775b4723d3b173792
libmthca:
git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3
commit 9501e698d257949acfab2edc90812602966dbcc9
libmlx4:
git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3
commit 3869d6dab7e12fe452270ca641f7dd7082b42482
libehca:
git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3
commit fd898180cfa3b737f893f432a80b91bac3396325
libipathverbs:
git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3
commit 82be4d81859d1fd2edf830220fe65a9923b80a46
libcxgb3:
git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3
commit 6f7485feb244d8571fcab2292ef92c97bea48df0
libnes:
git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3
commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3
libibcm:
git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3
commit 53ec35f544bbc1838bbadc2210909c25a954a5e2
librdmacm:
git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3
commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1
dapl1:
git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3
commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779
dapl2:
git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3
commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177
libsdp:
git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3
commit c8102dccc502930442b23de658674d386456b350
sdpnetstat:
git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3
commit 3341620a7259c4f7bdd4180864b98e260c3dc223
srptools:
git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3
commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d
perftest:
git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3
commit 6321b5468f7293088cc003809049c02b176130d8
qlvnictools:
git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3
commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd
tvflash:
git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3
commit f5e7407a7f2058448df5e5320d9843f944427429
mstflint:
git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3
commit 78bbd3d521a9078553a991111ffb6f76665b9ee9
qperf:
git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3
commit 6221aabd038df0b7033e035378ca190641ed2295
management:
git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3
commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf
ibutils:
git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3
commit 7daf94fab6eaf307316326f3f49704e6080a1508
ibsim:
git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3
commit 55113d9f919709c7c97ea41d29991941b9c8be70
ofa_kernel-1.3.1:
Git:
git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c
# MPI
mvapich-1.0.1-2533.src.rpm
mvapich2-1.0.3-1.src.rpm
openmpi-1.2.6-1.src.rpm
mpitests-3.0-773.src.rpm
kernel: ib_mthca 0000:06:00.0: Catastrophic error detected: unknown error
kernel: ib_mthca 0000:06:00.0: buf[00]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[01]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[02]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[03]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[04]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[05]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[06]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[07]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[08]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[09]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0a]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0b]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0c]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0d]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0e]: ffffffff
kernel: ib_mthca 0000:06:00.0: buf[0f]: ffffffff
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib0: ib_detach_mcast failed (result = -11)
kernel: ib0: ipoib_mcast_detach failed (result = -11)
kernel: ib0: ib_detach_mcast failed (result = -11)
kernel: ib0: ipoib_mcast_detach failed (result = -11)
kernel: ib0: Failed to modify QP to ERROR state
kernel: ib0: timing out; 0 sends 128 receives not completed
kernel: ib0: Failed to modify QP to RESET state
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_CQ failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_SRQ failed (-11)
kernel: ib_mthca 0000:06:00.0: HW2SW_MPT failed (-11)
kernel: ib_mthca 0000:01:00.0: Catastrophic error detected: internal parity error
kernel: ib_mthca 0000:01:00.0: buf[00]: 05000000
kernel: ib_mthca 0000:01:00.0: buf[01]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[02]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[03]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[04]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[05]: 00127f2c
kernel: ib_mthca 0000:01:00.0: buf[06]: 000a0056
kernel: ib_mthca 0000:01:00.0: buf[07]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[08]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[09]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0a]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0b]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0c]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0d]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0e]: 00000000
kernel: ib_mthca 0000:01:00.0: buf[0f]: 00000000
kernel: ib0: ib_query_port failed
--
Pawel Dziekonski <pawel.dziekonski at wcss.pl>
Wroclaw Centre for Networking & Supercomputing, HPC Department
Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND
phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl
More information about the general
mailing list