[ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource Temporarily unavailable"

Robert Dunkley Robert at saq.co.uk
Tue Nov 25 06:54:07 PST 2008


Hi Hal,

Thank you for your help.

Ibstat on MachineB:
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.2.0
        Hardware version: a0
        Node GUID: 0x0002c9020022d428
        System image GUID: 0x0002c9020022d42b
        Port 1:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510a6a
                Port GUID: 0x0002c9020022d429

Machine A is operating normally with the exception of Infiniband which
broke after powering down Machine B and did not recover once Machine B
was powered on again. An extract from the log of Machine A:
Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
(-11)
Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
(-11)
Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed
(-11)
Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)
Nov 25 14:32:01 mrhappy last message repeated 3 times
Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
(-11)

Thanks again,

Rob

-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: 25 November 2008 14:49
To: Robert Dunkley
Cc: Baur, Eric; general at lists.openfabrics.org
Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
Temporarily unavailable"

On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
wrote:
> Hi Eric,
>
> Thanks for the response. OpenSM is running and set to start on bootup
on
> MachineB:
> ps aux | grep open
> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>
> The log on Machine B just logs this every 10 seconds:
> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>
> Ibstat confirms port is in polling state on MachineB.

Is the port in init or down ?

> MachineA however is in a bad state,

Any additional details on this ?

Can you kill/unload all the ib stuff and reload it ? That would be
gentler than rebooting.

-- Hal

>I tried the openibd restart command, it accepted the
> command but after 5 minutes shows no progress of doing anything and is
> just at the cursor. Is some sort of forced restart of openibd
possible?
>
> Thanks,
>
> Rob
>
>
> -----Original Message-----
> From: Baur, Eric [mailto:Eric.Baur at gs.com]
> Sent: 25 November 2008 14:31
> To: Robert Dunkley
> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Robert-
>
> Is OpenSM set to start on boot?
>                chkconfig --list | grep opensmd
>
> If not:         chkconfig opensmd on
> and:            /etc/init.d/opensmd start
>
> You can also restart openib without rebooting the machines.
>                /etc/init.d/openibd restart
>
> -Eric
>
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
> Dunkley
> Sent: Tuesday, November 25, 2008 9:21 AM
> To: general at lists.openfabrics.org
> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Hi everyone,
>
> I'm using a setup of two machines (Lets call them A and B) directly
> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
Mellanox
> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
1.3
> installed, Machine B runs OpenSM.
>
> All was working fine. I shutdown Machine A did some maintenance and
then
> powered it on again, everything is OK again. I then shutdown Machine B
> (The one running OpenSM), this seemed to really upset Machine A. After
> booting Machine B again, Machine B looks OK with the port down and in
> polling state. Machine A however gives the following error if I run
> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
> (Resource temporarily unavailable)
>
> I don't want to reboot Machine A as it must synch data with Machine B
> over the Infiniband link first. Does anyone have any idea how to fix
> machine A?
>
> Thanks,
>
> Rob
>
> The SAQ Group
>
> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> SEMTEC Limited Trading as SAQ is Registered in England & Wales
> Company Number: 06481952
>
>
>
> http://www.saqnet.co.uk AS29219
>
> SAQ Group Delivers high quality, honestly priced communication and
I.T.
> services to UK Business.
>
> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
> Backups : Managed Networks : Remote Support.
>
> Find us in http://www.thebestof.co.uk/petersfield
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
>



More information about the general mailing list