[Users] mthca lockup
Orion Poplawski
orion at cora.nwra.com
Mon Sep 23 08:37:37 PDT 2013
I'm running Scientific Linux 6.4 and just saw the following:
Sep 21 12:44:38 castor kernel: rpcrdma: connection to 192.168.2.16:2050 closed
(-103)
Sep 21 12:45:43 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:46:10 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:46:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:47:59 castor kernel: nfs: server saga not responding, timed out
Sep 21 12:48:03 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:48:23 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:49:08 castor kernel: nfs: server earth.cora.nwra.com not responding,
still trying
Sep 21 12:49:28 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:49:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:50:03 castor kernel: Error: state manager failed on NFSv4 server
alexandria2ib with err
or 5
Sep 21 12:50:43 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:51:33 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:51:48 castor kernel: nfs: server earth.cora.nwra.com not responding,
still trying
Sep 21 12:52:08 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:52:11 castor kernel: ------------[ cut here ]------------
Sep 21 12:52:11 castor kernel: WARNING: at net/sched/sch_generic.c:261
dev_watchdog+0x26d/0x280()
(Tainted: G I--------------- )
Sep 21 12:52:11 castor kernel: Hardware name: X7DWT
Sep 21 12:52:11 castor kernel: NETDEV WATCHDOG: ib0 (ib_mthca): transmit queue
0 timed out
Sep 21 12:52:11 castor kernel: Modules linked in: des_generic ecb md4 nls_utf8
cifs xprtrdma nfs lockd fscache nfs_acl autofs4 rpcsec_gss_krb5 auth_rpcgss
sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa
e1000e radeon ttm drm_kms_helper drm i2c_algo_bit ib_mthca ib_mad ib_core
microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support
i5400_edac edac_core i5k_amb ioatdma dca shpchp ext4 jbd2 mbcache sd_mod
crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]
Sep 21 12:52:11 castor kernel: Pid: 0, comm: swapper Tainted: G
I--------------- 2.6.
32-358.18.1.el6.x86_64 #1
Sep 21 12:52:11 castor kernel: Call Trace:
Sep 21 12:52:11 castor kernel: <IRQ> [<ffffffff8106e3e7>] ?
warn_slowpath_common+0x87/0xc0
Sep 21 12:52:11 castor kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50
Sep 21 12:52:11 castor kernel: [<ffffffff81467f8d>] ? dev_watchdog+0x26d/0x280
Sep 21 12:52:11 castor kernel: [<ffffffff81090e00>] ? work_on_cpu+0xb0/0xd0
Sep 21 12:52:11 castor kernel: [<ffffffff810913d1>] ? __queue_work+0x41/0x50
Sep 21 12:52:11 castor kernel: [<ffffffff81467d20>] ? dev_watchdog+0x0/0x280
Sep 21 12:52:11 castor kernel: [<ffffffff81081937>] ?
run_timer_softirq+0x197/0x340
Sep 21 12:52:11 castor kernel: [<ffffffff810a8060>] ? tick_sched_timer+0x0/0xc0
Sep 21 12:52:11 castor kernel: [<ffffffff8102ea2d>] ? lapic_next_event+0x1d/0x30
Sep 21 12:52:11 castor kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0
Sep 21 12:52:11 castor kernel: [<ffffffff8109b87b>] ?
hrtimer_interrupt+0x14b/0x260
Sep 21 12:52:11 castor kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Sep 21 12:52:11 castor kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Sep 21 12:52:11 castor kernel: [<ffffffff81076e95>] ? irq_exit+0x85/0x90
Sep 21 12:52:11 castor kernel: [<ffffffff815177d0>] ?
smp_apic_timer_interrupt+0x70/0x9b
Sep 21 12:52:11 castor kernel: [<ffffffff8100bb93>] ?
apic_timer_interrupt+0x13/0x20
Sep 21 12:52:11 castor kernel: <EOI> [<ffffffff81307d27>] ?
acpi_idle_enter_simple+0x117/0x14b
Sep 21 12:52:11 castor kernel: [<ffffffff81307d20>] ?
acpi_idle_enter_simple+0x110/0x14b
Sep 21 12:52:11 castor kernel: [<ffffffff81307a2f>] ?
acpi_idle_enter_bm+0xef/0x2d0
Sep 21 12:52:11 castor kernel: [<ffffffff81416718>] ? menu_select+0x178/0x390
Sep 21 12:52:11 castor kernel: [<ffffffff814155f7>] ? cpuidle_idle_call+0xa7/0x140
Sep 21 12:52:11 castor kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Sep 21 12:52:11 castor kernel: [<ffffffff8150756c>] ? start_secondary+0x2ac/0x2ef
Sep 21 12:52:11 castor kernel: ---[ end trace 3f6ffbcc867bdba5 ]---
Sep 21 12:52:11 castor kernel: ib0: transmit timeout: latency 1997 msecs
Sep 21 12:52:11 castor kernel: ib0: queue stopped 1, tx_head 2265490, tx_tail
2265362
# ibstat
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.0
Hardware version: a0
Node GUID: 0x0005ad00000c593c
System image GUID: 0x0005ad00000c593f
Port 1:
State: Active
Physical state: LinkUp
Rate: 20
Base lid: 8
LMC: 0
SM lid: 1
Capability mask: 0x02510a68
Port GUID: 0x0005ad00000c593d
Link layer: InfiniBand
kernel-2.6.32-358.18.1.el6.x86_64
Anyone seen this before?
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane orion at nwra.com
Boulder, CO 80301 http://www.nwra.com
More information about the Users
mailing list