[Users] mthca lockup

Rupert Dance rsdance at soft-forge.com
Mon Sep 23 08:59:19 PDT 2013


Can you tell us what version of OFED you are running?

Thanks

Rupert

-----Original Message-----
From: users-bounces at lists.openfabrics.org
[mailto:users-bounces at lists.openfabrics.org] On Behalf Of Orion Poplawski
Sent: Monday, September 23, 2013 11:38 AM
To: Users at lists.openfabrics.org
Subject: [Users] mthca lockup

I'm running Scientific Linux 6.4 and just saw the following:

Sep 21 12:44:38 castor kernel: rpcrdma: connection to 192.168.2.16:2050
closed
(-103)
Sep 21 12:45:43 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
returned -16.
Sep 21 12:46:10 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
returned -16.
Sep 21 12:46:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
returned -16.
Sep 21 12:47:59 castor kernel: nfs: server saga not responding, timed out
Sep 21 12:48:03 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
returned -16.
Sep 21 12:48:23 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
returned -16.
Sep 21 12:49:08 castor kernel: nfs: server earth.cora.nwra.com not
responding, still trying Sep 21 12:49:28 castor kernel: ib_mthca
0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:49:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
returned -16.
Sep 21 12:50:03 castor kernel: Error: state manager failed on NFSv4 server
alexandria2ib with err or 5 Sep 21 12:50:43 castor kernel: ib_mthca
0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:51:33 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
returned -16.
Sep 21 12:51:48 castor kernel: nfs: server earth.cora.nwra.com not
responding, still trying Sep 21 12:52:08 castor kernel: ib_mthca
0000:01:00.0: modify QP 3->6 returned -16.
Sep 21 12:52:11 castor kernel: ------------[ cut here ]------------ Sep 21
12:52:11 castor kernel: WARNING: at net/sched/sch_generic.c:261
dev_watchdog+0x26d/0x280()
  (Tainted: G          I---------------   )
Sep 21 12:52:11 castor kernel: Hardware name: X7DWT Sep 21 12:52:11 castor
kernel: NETDEV WATCHDOG: ib0 (ib_mthca): transmit queue
0 timed out
Sep 21 12:52:11 castor kernel: Modules linked in: des_generic ecb md4
nls_utf8 cifs xprtrdma nfs lockd fscache nfs_acl autofs4 rpcsec_gss_krb5
auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa
e1000e radeon ttm drm_kms_helper drm i2c_algo_bit ib_mthca ib_mad ib_core
microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support
i5400_edac edac_core i5k_amb ioatdma dca shpchp ext4 jbd2 mbcache sd_mod
crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
scsi_wait_scan]
Sep 21 12:52:11 castor kernel: Pid: 0, comm: swapper Tainted: G 
I---------------    2.6.
32-358.18.1.el6.x86_64 #1
Sep 21 12:52:11 castor kernel: Call Trace:
Sep 21 12:52:11 castor kernel: <IRQ>  [<ffffffff8106e3e7>] ? 
warn_slowpath_common+0x87/0xc0
Sep 21 12:52:11 castor kernel: [<ffffffff8106e4d6>] ?
warn_slowpath_fmt+0x46/0x50 Sep 21 12:52:11 castor kernel:
[<ffffffff81467f8d>] ? dev_watchdog+0x26d/0x280 Sep 21 12:52:11 castor
kernel: [<ffffffff81090e00>] ? work_on_cpu+0xb0/0xd0 Sep 21 12:52:11 castor
kernel: [<ffffffff810913d1>] ? __queue_work+0x41/0x50 Sep 21 12:52:11 castor
kernel: [<ffffffff81467d20>] ? dev_watchdog+0x0/0x280 Sep 21 12:52:11 castor
kernel: [<ffffffff81081937>] ? 
run_timer_softirq+0x197/0x340
Sep 21 12:52:11 castor kernel: [<ffffffff810a8060>] ?
tick_sched_timer+0x0/0xc0 Sep 21 12:52:11 castor kernel:
[<ffffffff8102ea2d>] ? lapic_next_event+0x1d/0x30 Sep 21 12:52:11 castor
kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0 Sep 21 12:52:11
castor kernel: [<ffffffff8109b87b>] ? 
hrtimer_interrupt+0x14b/0x260
Sep 21 12:52:11 castor kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Sep 21 12:52:11 castor kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Sep 21 12:52:11 castor kernel: [<ffffffff81076e95>] ? irq_exit+0x85/0x90 Sep
21 12:52:11 castor kernel: [<ffffffff815177d0>] ? 
smp_apic_timer_interrupt+0x70/0x9b
Sep 21 12:52:11 castor kernel: [<ffffffff8100bb93>] ? 
apic_timer_interrupt+0x13/0x20
Sep 21 12:52:11 castor kernel: <EOI>  [<ffffffff81307d27>] ? 
acpi_idle_enter_simple+0x117/0x14b
Sep 21 12:52:11 castor kernel: [<ffffffff81307d20>] ? 
acpi_idle_enter_simple+0x110/0x14b
Sep 21 12:52:11 castor kernel: [<ffffffff81307a2f>] ? 
acpi_idle_enter_bm+0xef/0x2d0
Sep 21 12:52:11 castor kernel: [<ffffffff81416718>] ?
menu_select+0x178/0x390 Sep 21 12:52:11 castor kernel: [<ffffffff814155f7>]
? cpuidle_idle_call+0xa7/0x140 Sep 21 12:52:11 castor kernel:
[<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110 Sep 21 12:52:11 castor kernel:
[<ffffffff8150756c>] ? start_secondary+0x2ac/0x2ef Sep 21 12:52:11 castor
kernel: ---[ end trace 3f6ffbcc867bdba5 ]--- Sep 21 12:52:11 castor kernel:
ib0: transmit timeout: latency 1997 msecs Sep 21 12:52:11 castor kernel:
ib0: queue stopped 1, tx_head 2265490, tx_tail
2265362

# ibstat
CA 'mthca0'
         CA type: MT25204
         Number of ports: 1
         Firmware version: 1.2.0
         Hardware version: a0
         Node GUID: 0x0005ad00000c593c
         System image GUID: 0x0005ad00000c593f
         Port 1:
                 State: Active
                 Physical state: LinkUp
                 Rate: 20
                 Base lid: 8
                 LMC: 0
                 SM lid: 1
                 Capability mask: 0x02510a68
                 Port GUID: 0x0005ad00000c593d
                 Link layer: InfiniBand

kernel-2.6.32-358.18.1.el6.x86_64

Anyone seen this before?

--
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
Users mailing list
Users at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users





More information about the Users mailing list