***SPAM*** Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource Temporarily unavailable"
Hal Rosenstock
hal.rosenstock at gmail.com
Tue Nov 25 07:00:22 PST 2008
Hi Rob,
On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> Hi Hal,
>
> Thank you for your help.
>
> Ibstat on MachineB:
> CA 'mthca0'
> CA type: MT25204
> Number of ports: 1
> Firmware version: 1.2.0
> Hardware version: a0
> Node GUID: 0x0002c9020022d428
> System image GUID: 0x0002c9020022d42b
> Port 1:
> State: Down
Is machine A on ? Is mthca loaded there ? If so, this should at least
be init but the driver errors below may preclude this from occurring.
> Physical state: Polling
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510a6a
> Port GUID: 0x0002c9020022d429
>
> Machine A is operating normally with the exception of Infiniband which
> broke after powering down Machine B and did not recover once Machine B
> was powered on again. An extract from the log of Machine A:
> Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> (-11)
> Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> (-11)
> Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed
> (-11)
> Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
> Nov 25 14:32:01 mrhappy last message repeated 3 times
> Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> (-11)
-11 is EAGAIN. Not sure what this is used for in the mthca driver.
Can you unload and reload the IB stack especially mthca driver ?
-- Hal
> Thanks again,
>
> Rob
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: 25 November 2008 14:49
> To: Robert Dunkley
> Cc: Baur, Eric; general at lists.openfabrics.org
> Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
> Temporarily unavailable"
>
> On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
> wrote:
>> Hi Eric,
>>
>> Thanks for the response. OpenSM is running and set to start on bootup
> on
>> MachineB:
>> ps aux | grep open
>> root 5616 0.0 0.1 142004 1396 ? Sl 13:39 0:00
>> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
>>
>> The log on Machine B just logs this every 10 seconds:
>> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
>> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
>> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
>> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
>>
>> Ibstat confirms port is in polling state on MachineB.
>
> Is the port in init or down ?
>
>> MachineA however is in a bad state,
>
> Any additional details on this ?
>
> Can you kill/unload all the ib stuff and reload it ? That would be
> gentler than rebooting.
>
> -- Hal
>
>>I tried the openibd restart command, it accepted the
>> command but after 5 minutes shows no progress of doing anything and is
>> just at the cursor. Is some sort of forced restart of openibd
> possible?
>>
>> Thanks,
>>
>> Rob
>>
>>
>> -----Original Message-----
>> From: Baur, Eric [mailto:Eric.Baur at gs.com]
>> Sent: 25 November 2008 14:31
>> To: Robert Dunkley
>> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
>> Temporarily unavailable"
>>
>> Robert-
>>
>> Is OpenSM set to start on boot?
>> chkconfig --list | grep opensmd
>>
>> If not: chkconfig opensmd on
>> and: /etc/init.d/opensmd start
>>
>> You can also restart openib without rebooting the machines.
>> /etc/init.d/openibd restart
>>
>> -Eric
>>
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org
>> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
>> Dunkley
>> Sent: Tuesday, November 25, 2008 9:21 AM
>> To: general at lists.openfabrics.org
>> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
>> Temporarily unavailable"
>>
>> Hi everyone,
>>
>> I'm using a setup of two machines (Lets call them A and B) directly
>> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
> Mellanox
>> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
> 1.3
>> installed, Machine B runs OpenSM.
>>
>> All was working fine. I shutdown Machine A did some maintenance and
> then
>> powered it on again, everything is OK again. I then shutdown Machine B
>> (The one running OpenSM), this seemed to really upset Machine A. After
>> booting Machine B again, Machine B looks OK with the port down and in
>> polling state. Machine A however gives the following error if I run
>> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
>> (Resource temporarily unavailable)
>>
>> I don't want to reboot Machine A as it must synch data with Machine B
>> over the Infiniband link first. Does anyone have any idea how to fix
>> machine A?
>>
>> Thanks,
>>
>> Rob
>>
>> The SAQ Group
>>
>> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
>> SEMTEC Limited Trading as SAQ is Registered in England & Wales
>> Company Number: 06481952
>>
>>
>>
>> http://www.saqnet.co.uk AS29219
>>
>> SAQ Group Delivers high quality, honestly priced communication and
> I.T.
>> services to UK Business.
>>
>> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
>> Backups : Managed Networks : Remote Support.
>>
>> Find us in http://www.thebestof.co.uk/petersfield
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>>
>
More information about the general
mailing list