***SPAM*** Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource Temporarily unavailable"

Hal Rosenstock hal.rosenstock at gmail.com
Tue Nov 25 07:30:39 PST 2008


Hi Rob,

On Tue, Nov 25, 2008 at 10:21 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> Hi Hal,
>
> Thanks again, I will try this in a minute. I think I have found the
> moment it went bad on Machine A using Dmesg:
> ib_mthca 0000:87:00.0: Catastrophic error detected: unknown error

Definitely need to reset mthca after this.

> ib_mthca 0000:87:00.0:   buf[00]: ffffffff
> ib_mthca 0000:87:00.0:   buf[01]: ffffffff
> ib_mthca 0000:87:00.0:   buf[02]: ffffffff
> ib_mthca 0000:87:00.0:   buf[03]: ffffffff
> ib_mthca 0000:87:00.0:   buf[04]: ffffffff
> ib_mthca 0000:87:00.0:   buf[05]: ffffffff
> ib_mthca 0000:87:00.0:   buf[06]: ffffffff
> ib_mthca 0000:87:00.0:   buf[07]: ffffffff
> ib_mthca 0000:87:00.0:   buf[08]: ffffffff
> ib_mthca 0000:87:00.0:   buf[09]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0a]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0b]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0c]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0d]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0e]: ffffffff
> ib_mthca 0000:87:00.0:   buf[0f]: ffffffff
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib0: ib_query_gid() failed
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib0: ib_query_port failed
> ib0: Failed to modify QP to ERROR state
> ib0: timing out; 1 sends 250 receives not completed
> ib0: Failed to modify QP to RESET state
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_CQ failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_SRQ failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
> ib_mthca 0000:87:00.0: HW2SW_MPT failed (-11)
>
> Does this help to pinpoint what might have caused this?

Maybe Mellanox can comment. What firmware version are you using ?

-- Hal

>
> Thanks,
>
> Rob
>
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: 25 November 2008 15:19
> To: Robert Dunkley
> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> Hi Rob,
>
> On Tue, Nov 25, 2008 at 10:01 AM, Robert Dunkley <Robert at saq.co.uk>
> wrote:
>> Hi Hal,
>>
>> Machine A is definitely on and I have had the cable connection
> checked.
>> I'm afraid I'm not much of a techy, how do I unload and reload the IB
>> stack?
>
> It depends on what you have running... Is it just OpenSM and IPoIB ?
>
> Kill off opensm
>
> Use modprobe -r to remove all the ib_ modules. You can find them via
> lsmod | grep ib_. There is a dependency order.
>
> If you can get them all unloaded, reload them in the reverse order and
> hopefully things will be better...
>
> -- Hal
>
>> Thanks,
>>
>> Rob
>>
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>> Sent: 25 November 2008 15:00
>> To: Robert Dunkley
>> Cc: Baur, Eric; general at lists.openfabrics.org
>> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic -
> "Resource
>> Temporarily unavailable"
>>
>> Hi Rob,
>>
>> On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <Robert at saq.co.uk>
>> wrote:
>>> Hi Hal,
>>>
>>> Thank you for your help.
>>>
>>> Ibstat on MachineB:
>>> CA 'mthca0'
>>>        CA type: MT25204
>>>        Number of ports: 1
>>>        Firmware version: 1.2.0
>>>        Hardware version: a0
>>>        Node GUID: 0x0002c9020022d428
>>>        System image GUID: 0x0002c9020022d42b
>>>        Port 1:
>>>                State: Down
>>
>> Is machine A on ? Is mthca loaded there ? If so, this should at least
>> be init but the driver errors below may preclude this from occurring.
>>
>>>                Physical state: Polling
>>>                Rate: 10
>>>                Base lid: 0
>>>                LMC: 0
>>>                SM lid: 0
>>>                Capability mask: 0x02510a6a
>>>                Port GUID: 0x0002c9020022d429
>>>
>>> Machine A is operating normally with the exception of Infiniband
> which
>>> broke after powering down Machine B and did not recover once Machine
> B
>>> was powered on again. An extract from the log of Machine A:
>>> Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ
> failed
>>> (-11)
>>> Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ
> failed
>>> (-11)
>>> Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ
>> failed
>>> (-11)
>>> Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>> Nov 25 14:32:01 mrhappy last message repeated 3 times
>>> Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT
>> failed
>>> (-11)
>>
>> -11 is EAGAIN. Not sure what this is used for in the mthca driver.
>>
>> Can you unload and reload the IB stack especially mthca driver ?
>>
>> -- Hal
>>
>>> Thanks again,
>>>
>>> Rob
>>>
>>> -----Original Message-----
>>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>>> Sent: 25 November 2008 14:49
>>> To: Robert Dunkley
>>> Cc: Baur, Eric; general at lists.openfabrics.org
>>> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic -
>> "Resource
>>> Temporarily unavailable"
>>>
>>> On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
>>> wrote:
>>>> Hi Eric,
>>>>
>>>> Thanks for the response. OpenSM is running and set to start on
> bootup
>>> on
>>>> MachineB:
>>>> ps aux | grep open
>>>> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
>>>> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>>>>
>>>> The log on Machine B just logs this every 10 seconds:
>>>> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
>>>> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
>>>> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
>>>> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>>>>
>>>> Ibstat confirms port is in polling state on MachineB.
>>>
>>> Is the port in init or down ?
>>>
>>>> MachineA however is in a bad state,
>>>
>>> Any additional details on this ?
>>>
>>> Can you kill/unload all the ib stuff and reload it ? That would be
>>> gentler than rebooting.
>>>
>>> -- Hal
>>>
>>>>I tried the openibd restart command, it accepted the
>>>> command but after 5 minutes shows no progress of doing anything and
>> is
>>>> just at the cursor. Is some sort of forced restart of openibd
>>> possible?
>>>>
>>>> Thanks,
>>>>
>>>> Rob
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Baur, Eric [mailto:Eric.Baur at gs.com]
>>>> Sent: 25 November 2008 14:31
>>>> To: Robert Dunkley
>>>> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic -
>> "Resource
>>>> Temporarily unavailable"
>>>>
>>>> Robert-
>>>>
>>>> Is OpenSM set to start on boot?
>>>>                chkconfig --list | grep opensmd
>>>>
>>>> If not:         chkconfig opensmd on
>>>> and:            /etc/init.d/opensmd start
>>>>
>>>> You can also restart openib without rebooting the machines.
>>>>                /etc/init.d/openibd restart
>>>>
>>>> -Eric
>>>>
>>>> -----Original Message-----
>>>> From: general-bounces at lists.openfabrics.org
>>>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
>>>> Dunkley
>>>> Sent: Tuesday, November 25, 2008 9:21 AM
>>>> To: general at lists.openfabrics.org
>>>> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
>>>> Temporarily unavailable"
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm using a setup of two machines (Lets call them A and B) directly
>>>> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
>>> Mellanox
>>>> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
>>> 1.3
>>>> installed, Machine B runs OpenSM.
>>>>
>>>> All was working fine. I shutdown Machine A did some maintenance and
>>> then
>>>> powered it on again, everything is OK again. I then shutdown Machine
>> B
>>>> (The one running OpenSM), this seemed to really upset Machine A.
>> After
>>>> booting Machine B again, Machine B looks OK with the port down and
> in
>>>> polling state. Machine A however gives the following error if I run
>>>> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
>>>> (Resource temporarily unavailable)
>>>>
>>>> I don't want to reboot Machine A as it must synch data with Machine
> B
>>>> over the Infiniband link first. Does anyone have any idea how to fix
>>>> machine A?
>>>>
>>>> Thanks,
>>>>
>>>> Rob
>>>>
>>>> The SAQ Group
>>>>
>>>> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
>>>> SEMTEC Limited Trading as SAQ is Registered in England & Wales
>>>> Company Number: 06481952
>>>>
>>>>
>>>>
>>>> http://www.saqnet.co.uk AS29219
>>>>
>>>> SAQ Group Delivers high quality, honestly priced communication and
>>> I.T.
>>>> services to UK Business.
>>>>
>>>> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
>>>> Backups : Managed Networks : Remote Support.
>>>>
>>>> Find us in http://www.thebestof.co.uk/petersfield
>>>>
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit
>>>> http://openib.org/mailman/listinfo/openib-general
>>>> _______________________________________________
>>>> general mailing list
>>>> general at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>>
>>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>
>>
>



More information about the general mailing list