[Users] mthca lockup

Orion Poplawski orion at cora.nwra.com
Mon Sep 23 09:23:17 PDT 2013


On 09/23/2013 09:59 AM, Rupert Dance wrote:
> Can you tell us what version of OFED you are running?

Not sure how to tell.  What ever comes with RHEL/SL 6.4.  Does this help?

libibumad-1.3.8-1.el6.x86_64
libibmad-1.3.9-1.el6.x86_64
libibverbs-1.1.6-5.el6.x86_64
libmthca-1.0.6-3.el6.x86_64


> -----Original Message-----
> From: users-bounces at lists.openfabrics.org
> [mailto:users-bounces at lists.openfabrics.org] On Behalf Of Orion Poplawski
> Sent: Monday, September 23, 2013 11:38 AM
> To: Users at lists.openfabrics.org
> Subject: [Users] mthca lockup
>
> I'm running Scientific Linux 6.4 and just saw the following:
>
> Sep 21 12:44:38 castor kernel: rpcrdma: connection to 192.168.2.16:2050
> closed
> (-103)
> Sep 21 12:45:43 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
> returned -16.
> Sep 21 12:46:10 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
> returned -16.
> Sep 21 12:46:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
> returned -16.
> Sep 21 12:47:59 castor kernel: nfs: server saga not responding, timed out
> Sep 21 12:48:03 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
> returned -16.
> Sep 21 12:48:23 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
> returned -16.
> Sep 21 12:49:08 castor kernel: nfs: server earth.cora.nwra.com not
> responding, still trying Sep 21 12:49:28 castor kernel: ib_mthca
> 0000:01:00.0: modify QP 3->6 returned -16.
> Sep 21 12:49:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
> returned -16.
> Sep 21 12:50:03 castor kernel: Error: state manager failed on NFSv4 server
> alexandria2ib with err or 5 Sep 21 12:50:43 castor kernel: ib_mthca
> 0000:01:00.0: modify QP 3->6 returned -16.
> Sep 21 12:51:33 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6
> returned -16.
> Sep 21 12:51:48 castor kernel: nfs: server earth.cora.nwra.com not
> responding, still trying Sep 21 12:52:08 castor kernel: ib_mthca
> 0000:01:00.0: modify QP 3->6 returned -16.
> Sep 21 12:52:11 castor kernel: ------------[ cut here ]------------ Sep 21
> 12:52:11 castor kernel: WARNING: at net/sched/sch_generic.c:261
> dev_watchdog+0x26d/0x280()
>    (Tainted: G          I---------------   )
> Sep 21 12:52:11 castor kernel: Hardware name: X7DWT Sep 21 12:52:11 castor
> kernel: NETDEV WATCHDOG: ib0 (ib_mthca): transmit queue
> 0 timed out
> Sep 21 12:52:11 castor kernel: Modules linked in: des_generic ecb md4
> nls_utf8 cifs xprtrdma nfs lockd fscache nfs_acl autofs4 rpcsec_gss_krb5
> auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib
> rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa
> e1000e radeon ttm drm_kms_helper drm i2c_algo_bit ib_mthca ib_mad ib_core
> microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support
> i5400_edac edac_core i5k_amb ioatdma dca shpchp ext4 jbd2 mbcache sd_mod
> crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
> scsi_wait_scan]
> Sep 21 12:52:11 castor kernel: Pid: 0, comm: swapper Tainted: G
> I---------------    2.6.
> 32-358.18.1.el6.x86_64 #1
> Sep 21 12:52:11 castor kernel: Call Trace:
> Sep 21 12:52:11 castor kernel: <IRQ>  [<ffffffff8106e3e7>] ?
> warn_slowpath_common+0x87/0xc0
> Sep 21 12:52:11 castor kernel: [<ffffffff8106e4d6>] ?
> warn_slowpath_fmt+0x46/0x50 Sep 21 12:52:11 castor kernel:
> [<ffffffff81467f8d>] ? dev_watchdog+0x26d/0x280 Sep 21 12:52:11 castor
> kernel: [<ffffffff81090e00>] ? work_on_cpu+0xb0/0xd0 Sep 21 12:52:11 castor
> kernel: [<ffffffff810913d1>] ? __queue_work+0x41/0x50 Sep 21 12:52:11 castor
> kernel: [<ffffffff81467d20>] ? dev_watchdog+0x0/0x280 Sep 21 12:52:11 castor
> kernel: [<ffffffff81081937>] ?
> run_timer_softirq+0x197/0x340
> Sep 21 12:52:11 castor kernel: [<ffffffff810a8060>] ?
> tick_sched_timer+0x0/0xc0 Sep 21 12:52:11 castor kernel:
> [<ffffffff8102ea2d>] ? lapic_next_event+0x1d/0x30 Sep 21 12:52:11 castor
> kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0 Sep 21 12:52:11
> castor kernel: [<ffffffff8109b87b>] ?
> hrtimer_interrupt+0x14b/0x260
> Sep 21 12:52:11 castor kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
> Sep 21 12:52:11 castor kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
> Sep 21 12:52:11 castor kernel: [<ffffffff81076e95>] ? irq_exit+0x85/0x90 Sep
> 21 12:52:11 castor kernel: [<ffffffff815177d0>] ?
> smp_apic_timer_interrupt+0x70/0x9b
> Sep 21 12:52:11 castor kernel: [<ffffffff8100bb93>] ?
> apic_timer_interrupt+0x13/0x20
> Sep 21 12:52:11 castor kernel: <EOI>  [<ffffffff81307d27>] ?
> acpi_idle_enter_simple+0x117/0x14b
> Sep 21 12:52:11 castor kernel: [<ffffffff81307d20>] ?
> acpi_idle_enter_simple+0x110/0x14b
> Sep 21 12:52:11 castor kernel: [<ffffffff81307a2f>] ?
> acpi_idle_enter_bm+0xef/0x2d0
> Sep 21 12:52:11 castor kernel: [<ffffffff81416718>] ?
> menu_select+0x178/0x390 Sep 21 12:52:11 castor kernel: [<ffffffff814155f7>]
> ? cpuidle_idle_call+0xa7/0x140 Sep 21 12:52:11 castor kernel:
> [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110 Sep 21 12:52:11 castor kernel:
> [<ffffffff8150756c>] ? start_secondary+0x2ac/0x2ef Sep 21 12:52:11 castor
> kernel: ---[ end trace 3f6ffbcc867bdba5 ]--- Sep 21 12:52:11 castor kernel:
> ib0: transmit timeout: latency 1997 msecs Sep 21 12:52:11 castor kernel:
> ib0: queue stopped 1, tx_head 2265490, tx_tail
> 2265362
>
> # ibstat
> CA 'mthca0'
>           CA type: MT25204
>           Number of ports: 1
>           Firmware version: 1.2.0
>           Hardware version: a0
>           Node GUID: 0x0005ad00000c593c
>           System image GUID: 0x0005ad00000c593f
>           Port 1:
>                   State: Active
>                   Physical state: LinkUp
>                   Rate: 20
>                   Base lid: 8
>                   LMC: 0
>                   SM lid: 1
>                   Capability mask: 0x02510a68
>                   Port GUID: 0x0005ad00000c593d
>                   Link layer: InfiniBand
>
> kernel-2.6.32-358.18.1.el6.x86_64
>
> Anyone seen this before?
>
> --
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder/CoRA Office             FAX: 303-415-9702
> 3380 Mitchell Lane                       orion at nwra.com
> Boulder, CO 80301                   http://www.nwra.com
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>


-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com



More information about the Users mailing list