[Users] mthca lockup

Rupert Dance rsdance at soft-forge.com
Mon Sep 23 11:00:13 PDT 2013


When OFA software is installed from the OFED distribution, a utility is
included called "ofed_info" which will spit out a lot of data about what was
installed. A simpler command is available using "ofed_info -s" which gives
just the version. Things may be slightly different in the packaging from
various Distros.

The reason I asked about the version is that OFED 3.5-2 includes an updated
version of the mthca module and so I was curious if this could be related.
If you want to try the latest build from the OFA you can find it here but be
aware that you can get conflicts between the Distro version of OFA software
and OFED itself. So try to remove all support for OFED before you installed
the 3.5-2 package. If this is a production cluster, you may be best to try
it on a test cluster first.

http://www.openfabrics.org/downloads/OFED/ofed-3.5-2/OFED-3.5-2-rc1.tgz

-----Original Message-----
From: Orion Poplawski [mailto:orion at cora.nwra.com] 
Sent: Monday, September 23, 2013 12:23 PM
To: Rupert Dance
Cc: Users at lists.openfabrics.org
Subject: Re: [Users] mthca lockup

On 09/23/2013 09:59 AM, Rupert Dance wrote:
> Can you tell us what version of OFED you are running?

Not sure how to tell.  What ever comes with RHEL/SL 6.4.  Does this help?

libibumad-1.3.8-1.el6.x86_64
libibmad-1.3.9-1.el6.x86_64
libibverbs-1.1.6-5.el6.x86_64
libmthca-1.0.6-3.el6.x86_64


> -----Original Message-----
> From: users-bounces at lists.openfabrics.org
> [mailto:users-bounces at lists.openfabrics.org] On Behalf Of Orion 
> Poplawski
> Sent: Monday, September 23, 2013 11:38 AM
> To: Users at lists.openfabrics.org
> Subject: [Users] mthca lockup
>
> I'm running Scientific Linux 6.4 and just saw the following:
>
> Sep 21 12:44:38 castor kernel: rpcrdma: connection to 
> 192.168.2.16:2050 closed
> (-103)
> Sep 21 12:45:43 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 
> returned -16.
> Sep 21 12:46:10 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 
> returned -16.
> Sep 21 12:46:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 
> returned -16.
> Sep 21 12:47:59 castor kernel: nfs: server saga not responding, timed 
> out Sep 21 12:48:03 castor kernel: ib_mthca 0000:01:00.0: modify QP 
> 3->6 returned -16.
> Sep 21 12:48:23 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 
> returned -16.
> Sep 21 12:49:08 castor kernel: nfs: server earth.cora.nwra.com not 
> responding, still trying Sep 21 12:49:28 castor kernel: ib_mthca
> 0000:01:00.0: modify QP 3->6 returned -16.
> Sep 21 12:49:58 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 
> returned -16.
> Sep 21 12:50:03 castor kernel: Error: state manager failed on NFSv4 
> server alexandria2ib with err or 5 Sep 21 12:50:43 castor kernel: 
> ib_mthca
> 0000:01:00.0: modify QP 3->6 returned -16.
> Sep 21 12:51:33 castor kernel: ib_mthca 0000:01:00.0: modify QP 3->6 
> returned -16.
> Sep 21 12:51:48 castor kernel: nfs: server earth.cora.nwra.com not 
> responding, still trying Sep 21 12:52:08 castor kernel: ib_mthca
> 0000:01:00.0: modify QP 3->6 returned -16.
> Sep 21 12:52:11 castor kernel: ------------[ cut here ]------------ 
> Sep 21
> 12:52:11 castor kernel: WARNING: at net/sched/sch_generic.c:261
> dev_watchdog+0x26d/0x280()
>    (Tainted: G          I---------------   )
> Sep 21 12:52:11 castor kernel: Hardware name: X7DWT Sep 21 12:52:11 
> castor
> kernel: NETDEV WATCHDOG: ib0 (ib_mthca): transmit queue
> 0 timed out
> Sep 21 12:52:11 castor kernel: Modules linked in: des_generic ecb md4
> nls_utf8 cifs xprtrdma nfs lockd fscache nfs_acl autofs4 
> rpcsec_gss_krb5 auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq 
> freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm 
> ib_cm iw_cm ib_addr ipv6 ib_sa e1000e radeon ttm drm_kms_helper drm 
> i2c_algo_bit ib_mthca ib_mad ib_core microcode serio_raw i2c_i801 
> i2c_core sg iTCO_wdt iTCO_vendor_support i5400_edac edac_core i5k_amb 
> ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif ahci dm_mirror
dm_region_hash dm_log dm_mod [last unloaded:
> scsi_wait_scan]
> Sep 21 12:52:11 castor kernel: Pid: 0, comm: swapper Tainted: G
> I---------------    2.6.
> 32-358.18.1.el6.x86_64 #1
> Sep 21 12:52:11 castor kernel: Call Trace:
> Sep 21 12:52:11 castor kernel: <IRQ>  [<ffffffff8106e3e7>] ?
> warn_slowpath_common+0x87/0xc0
> Sep 21 12:52:11 castor kernel: [<ffffffff8106e4d6>] ?
> warn_slowpath_fmt+0x46/0x50 Sep 21 12:52:11 castor kernel:
> [<ffffffff81467f8d>] ? dev_watchdog+0x26d/0x280 Sep 21 12:52:11 castor
> kernel: [<ffffffff81090e00>] ? work_on_cpu+0xb0/0xd0 Sep 21 12:52:11 
> castor
> kernel: [<ffffffff810913d1>] ? __queue_work+0x41/0x50 Sep 21 12:52:11 
> castor
> kernel: [<ffffffff81467d20>] ? dev_watchdog+0x0/0x280 Sep 21 12:52:11 
> castor
> kernel: [<ffffffff81081937>] ?
> run_timer_softirq+0x197/0x340
> Sep 21 12:52:11 castor kernel: [<ffffffff810a8060>] ?
> tick_sched_timer+0x0/0xc0 Sep 21 12:52:11 castor kernel:
> [<ffffffff8102ea2d>] ? lapic_next_event+0x1d/0x30 Sep 21 12:52:11 
> castor
> kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0 Sep 21 12:52:11 
> castor kernel: [<ffffffff8109b87b>] ?
> hrtimer_interrupt+0x14b/0x260
> Sep 21 12:52:11 castor kernel: [<ffffffff8100c1cc>] ? 
> call_softirq+0x1c/0x30 Sep 21 12:52:11 castor kernel: 
> [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 Sep 21 12:52:11 castor 
> kernel: [<ffffffff81076e95>] ? irq_exit+0x85/0x90 Sep
> 21 12:52:11 castor kernel: [<ffffffff815177d0>] ?
> smp_apic_timer_interrupt+0x70/0x9b
> Sep 21 12:52:11 castor kernel: [<ffffffff8100bb93>] ?
> apic_timer_interrupt+0x13/0x20
> Sep 21 12:52:11 castor kernel: <EOI>  [<ffffffff81307d27>] ?
> acpi_idle_enter_simple+0x117/0x14b
> Sep 21 12:52:11 castor kernel: [<ffffffff81307d20>] ?
> acpi_idle_enter_simple+0x110/0x14b
> Sep 21 12:52:11 castor kernel: [<ffffffff81307a2f>] ?
> acpi_idle_enter_bm+0xef/0x2d0
> Sep 21 12:52:11 castor kernel: [<ffffffff81416718>] ?
> menu_select+0x178/0x390 Sep 21 12:52:11 castor kernel: 
> [<ffffffff814155f7>] ? cpuidle_idle_call+0xa7/0x140 Sep 21 12:52:11 castor
kernel:
> [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110 Sep 21 12:52:11 castor kernel:
> [<ffffffff8150756c>] ? start_secondary+0x2ac/0x2ef Sep 21 12:52:11 
> castor
> kernel: ---[ end trace 3f6ffbcc867bdba5 ]--- Sep 21 12:52:11 castor
kernel:
> ib0: transmit timeout: latency 1997 msecs Sep 21 12:52:11 castor kernel:
> ib0: queue stopped 1, tx_head 2265490, tx_tail
> 2265362
>
> # ibstat
> CA 'mthca0'
>           CA type: MT25204
>           Number of ports: 1
>           Firmware version: 1.2.0
>           Hardware version: a0
>           Node GUID: 0x0005ad00000c593c
>           System image GUID: 0x0005ad00000c593f
>           Port 1:
>                   State: Active
>                   Physical state: LinkUp
>                   Rate: 20
>                   Base lid: 8
>                   LMC: 0
>                   SM lid: 1
>                   Capability mask: 0x02510a68
>                   Port GUID: 0x0005ad00000c593d
>                   Link layer: InfiniBand
>
> kernel-2.6.32-358.18.1.el6.x86_64
>
> Anyone seen this before?
>
> --
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder/CoRA Office             FAX: 303-415-9702
> 3380 Mitchell Lane                       orion at nwra.com
> Boulder, CO 80301                   http://www.nwra.com
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>


--
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com





More information about the Users mailing list