***SPAM*** Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource Temporarily unavailable"

Ira Weiny weiny2 at llnl.gov
Tue Nov 25 10:59:37 PST 2008


On Tue, 25 Nov 2008 10:00:22 -0500
"Hal Rosenstock" <hal.rosenstock at gmail.com> wrote:

> Hi Rob,
> 
> On Tue, Nov 25, 2008 at 9:54 AM, Robert Dunkley <Robert at saq.co.uk> wrote:
> > Hi Hal,
> >
> > Thank you for your help.
> >
> > Ibstat on MachineB:
> > CA 'mthca0'
> >        CA type: MT25204
> >        Number of ports: 1
> >        Firmware version: 1.2.0
> >        Hardware version: a0
> >        Node GUID: 0x0002c9020022d428
> >        System image GUID: 0x0002c9020022d42b
> >        Port 1:
> >                State: Down
> 
> Is machine A on ? Is mthca loaded there ? If so, this should at least
> be init but the driver errors below may preclude this from occurring.
> 
> >                Physical state: Polling
> >                Rate: 10
> >                Base lid: 0
> >                LMC: 0
> >                SM lid: 0
> >                Capability mask: 0x02510a6a
> >                Port GUID: 0x0002c9020022d429
> >
> > Machine A is operating normally with the exception of Infiniband which
> > broke after powering down Machine B and did not recover once Machine B
> > was powered on again. An extract from the log of Machine A:
> > Nov 25 14:30:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:30:31 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> > (-11)
> > Nov 25 14:30:41 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:30:51 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_CQ failed
> > (-11)
> > Nov 25 14:31:01 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:31:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_SRQ failed
> > (-11)
> > Nov 25 14:31:21 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> > Nov 25 14:32:01 mrhappy last message repeated 3 times
> > Nov 25 14:32:11 mrhappy kernel: ib_mthca 0000:87:00.0: HW2SW_MPT failed
> > (-11)
> 
> -11 is EAGAIN. Not sure what this is used for in the mthca driver.

When we have seen these errors, it has meant the firmware is in a bad state and
is not responsive.  Unfortunately for you, in this situation we have been
forced to reboot to correct the problem.  (If rebooting is problematic for you
perhaps Mellanox has a way around this.)

For the future speak with Mellanox to ensure you have the latest firmware as
that has fixed a number of items for us.

Ira

> 
> Can you unload and reload the IB stack especially mthca driver ?
> 
> -- Hal
> 
> > Thanks again,
> >
> > Rob
> >
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > Sent: 25 November 2008 14:49
> > To: Robert Dunkley
> > Cc: Baur, Eric; general at lists.openfabrics.org
> > Subject: Re: [ofa-general] Mellanox Gen3, Linux and ibpanic - "Resource
> > Temporarily unavailable"
> >
> > On Tue, Nov 25, 2008 at 9:39 AM, Robert Dunkley <Robert at saq.co.uk>
> > wrote:
> >> Hi Eric,
> >>
> >> Thanks for the response. OpenSM is running and set to start on bootup
> > on
> >> MachineB:
> >> ps aux | grep open
> >> root      5616  0.0  0.1 142004  1396 ?        Sl   13:39   0:00
> >> /usr/sbin/opensm -t 200 -f /var/log/opensm.log -g 0
> >>
> >> The log on Machine B just logs this every 10 seconds:
> >> Nov 25 14:34:21 148541 [477A7940] 0x01 ->
> >> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal
> >> OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING
> >> Nov 25 14:34:31 153173 [477A7940] 0x80 -> SM port is down
> >>
> >> Ibstat confirms port is in polling state on MachineB.
> >
> > Is the port in init or down ?
> >
> >> MachineA however is in a bad state,
> >
> > Any additional details on this ?
> >
> > Can you kill/unload all the ib stuff and reload it ? That would be
> > gentler than rebooting.
> >
> > -- Hal
> >
> >>I tried the openibd restart command, it accepted the
> >> command but after 5 minutes shows no progress of doing anything and is
> >> just at the cursor. Is some sort of forced restart of openibd
> > possible?
> >>
> >> Thanks,
> >>
> >> Rob
> >>
> >>
> >> -----Original Message-----
> >> From: Baur, Eric [mailto:Eric.Baur at gs.com]
> >> Sent: 25 November 2008 14:31
> >> To: Robert Dunkley
> >> Subject: RE: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> >> Temporarily unavailable"
> >>
> >> Robert-
> >>
> >> Is OpenSM set to start on boot?
> >>                chkconfig --list | grep opensmd
> >>
> >> If not:         chkconfig opensmd on
> >> and:            /etc/init.d/opensmd start
> >>
> >> You can also restart openib without rebooting the machines.
> >>                /etc/init.d/openibd restart
> >>
> >> -Eric
> >>
> >> -----Original Message-----
> >> From: general-bounces at lists.openfabrics.org
> >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert
> >> Dunkley
> >> Sent: Tuesday, November 25, 2008 9:21 AM
> >> To: general at lists.openfabrics.org
> >> Subject: [ofa-general] Mellanox Gen3,Linux and ibpanic - "Resource
> >> Temporarily unavailable"
> >>
> >> Hi everyone,
> >>
> >> I'm using a setup of two machines (Lets call them A and B) directly
> >> connected by 1 cable. Each machine has a Mellanox MT25204 (Gen3
> > Mellanox
> >> PCI-E Infiniband card) and uses IPOIB, they run Centos 5.2 with OFED
> > 1.3
> >> installed, Machine B runs OpenSM.
> >>
> >> All was working fine. I shutdown Machine A did some maintenance and
> > then
> >> powered it on again, everything is OK again. I then shutdown Machine B
> >> (The one running OpenSM), this seemed to really upset Machine A. After
> >> booting Machine B again, Machine B looks OK with the port down and in
> >> polling state. Machine A however gives the following error if I run
> >> ibstat: ibpanic: [11406] main: stat of IB device 'mthca0' failed:
> >> (Resource temporarily unavailable)
> >>
> >> I don't want to reboot Machine A as it must synch data with Machine B
> >> over the Infiniband link first. Does anyone have any idea how to fix
> >> machine A?
> >>
> >> Thanks,
> >>
> >> Rob
> >>
> >> The SAQ Group
> >>
> >> Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ
> >> SEMTEC Limited Trading as SAQ is Registered in England & Wales
> >> Company Number: 06481952
> >>
> >>
> >>
> >> http:// www. saqnet.co.uk AS29219
> >>
> >> SAQ Group Delivers high quality, honestly priced communication and
> > I.T.
> >> services to UK Business.
> >>
> >> DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit :
> >> Backups : Managed Networks : Remote Support.
> >>
> >> Find us in http:// www. thebestof.co.uk/petersfield
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> >> http:// openib.org/mailman/listinfo/openib-general
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> > http:// openib.org/mailman/listinfo/openib-general
> >>
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> 



More information about the general mailing list