[Users] mthca lockup

Orion Poplawski orion at cora.nwra.com
Mon Sep 23 08:37:37 PDT 2013


I'm running Scientific Linux 6.4 and just saw the following:

Sep 21 12:44:38 castor kernel: rpcrdma: connection to 192.168.2.16:2050 closed 
(-103)
Sep 21 12:45:43 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:46:10 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:46:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:47:59 castor kernel: nfs: server saga not responding, timed out
Sep 21 12:48:03 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:48:23 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:49:08 castor kernel: nfs: server earth.cora.nwra.com not responding, 
still trying
Sep 21 12:49:28 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:49:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:50:03 castor kernel: Error: state manager failed on NFSv4 server 
alexandria2ib with err
or 5
Sep 21 12:50:43 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:51:33 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:51:48 castor kernel: nfs: server earth.cora.nwra.com not responding, 
still trying
Sep 21 12:52:08 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:52:11 castor kernel: ------------[ cut here ]------------
Sep 21 12:52:11 castor kernel: WARNING: at net/sched/sch_generic.c:261 
dev_watchdog+0x26d/0x280()
  (Tainted: G          I---------------   )
Sep 21 12:52:11 castor kernel: Hardware name: X7DWT
Sep 21 12:52:11 castor kernel: NETDEV WATCHDOG: ib0 (ib_mthca): transmit queue 
0 timed out
Sep 21 12:52:11 castor kernel: Modules linked in: des_generic ecb md4 nls_utf8 
cifs xprtrdma nfs lockd fscache nfs_acl autofs4 rpcsec_gss_krb5 auth_rpcgss 
sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa
e1000e radeon ttm drm_kms_helper drm i2c_algo_bit ib_mthca ib_mad ib_core 
microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support 
i5400_edac edac_core i5k_amb ioatdma dca shpchp ext4 jbd2 mbcache sd_mod 
crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
scsi_wait_scan]
Sep 21 12:52:11 castor kernel: Pid: 0, comm: swapper Tainted: G 
I---------------    2.6.
32-358.18.1.el6.x86_64 #1
Sep 21 12:52:11 castor kernel: Call Trace:
Sep 21 12:52:11 castor kernel: <IRQ>  [<ffffffff8106e3e7>] ? 
warn_slowpath_common+0x87/0xc0
Sep 21 12:52:11 castor kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50
Sep 21 12:52:11 castor kernel: [<ffffffff81467f8d>] ? dev_watchdog+0x26d/0x280
Sep 21 12:52:11 castor kernel: [<ffffffff81090e00>] ? work_on_cpu+0xb0/0xd0
Sep 21 12:52:11 castor kernel: [<ffffffff810913d1>] ? __queue_work+0x41/0x50
Sep 21 12:52:11 castor kernel: [<ffffffff81467d20>] ? dev_watchdog+0x0/0x280
Sep 21 12:52:11 castor kernel: [<ffffffff81081937>] ? 
run_timer_softirq+0x197/0x340
Sep 21 12:52:11 castor kernel: [<ffffffff810a8060>] ? tick_sched_timer+0x0/0xc0
Sep 21 12:52:11 castor kernel: [<ffffffff8102ea2d>] ? lapic_next_event+0x1d/0x30
Sep 21 12:52:11 castor kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0
Sep 21 12:52:11 castor kernel: [<ffffffff8109b87b>] ? 
hrtimer_interrupt+0x14b/0x260
Sep 21 12:52:11 castor kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Sep 21 12:52:11 castor kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Sep 21 12:52:11 castor kernel: [<ffffffff81076e95>] ? irq_exit+0x85/0x90
Sep 21 12:52:11 castor kernel: [<ffffffff815177d0>] ? 
smp_apic_timer_interrupt+0x70/0x9b
Sep 21 12:52:11 castor kernel: [<ffffffff8100bb93>] ? 
apic_timer_interrupt+0x13/0x20
Sep 21 12:52:11 castor kernel: <EOI>  [<ffffffff81307d27>] ? 
acpi_idle_enter_simple+0x117/0x14b
Sep 21 12:52:11 castor kernel: [<ffffffff81307d20>] ? 
acpi_idle_enter_simple+0x110/0x14b
Sep 21 12:52:11 castor kernel: [<ffffffff81307a2f>] ? 
acpi_idle_enter_bm+0xef/0x2d0
Sep 21 12:52:11 castor kernel: [<ffffffff81416718>] ? menu_select+0x178/0x390
Sep 21 12:52:11 castor kernel: [<ffffffff814155f7>] ? cpuidle_idle_call+0xa7/0x140
Sep 21 12:52:11 castor kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Sep 21 12:52:11 castor kernel: [<ffffffff8150756c>] ? start_secondary+0x2ac/0x2ef
Sep 21 12:52:11 castor kernel: ---[ end trace 3f6ffbcc867bdba5 ]---
Sep 21 12:52:11 castor kernel: ib0: transmit timeout: latency 1997 msecs
Sep 21 12:52:11 castor kernel: ib0: queue stopped 1, tx_head 2265490, tx_tail 
2265362

# ibstat
CA 'mthca0'
         CA type: MT25204
         Number of ports: 1
         Firmware version: 1.2.0
         Hardware version: a0
         Node GUID: 0x0005ad00000c593c
         System image GUID: 0x0005ad00000c593f
         Port 1:
                 State: Active
                 Physical state: LinkUp
                 Rate: 20
                 Base lid: 8
                 LMC: 0
                 SM lid: 1
                 Capability mask: 0x02510a68
                 Port GUID: 0x0005ad00000c593d
                 Link layer: InfiniBand

kernel-2.6.32-358.18.1.el6.x86_64

Anyone seen this before?

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com



More information about the Users mailing list