[ofa-general] IB interfaces occasionally go down & come up for no reason

Ira Weiny weiny2 at llnl.gov
Thu Dec 18 09:20:00 PST 2008


Sumeet,

On Thu, 18 Dec 2008 08:54:04 -0800
Sumeet Lahorani <Sumeet.Lahorani at oracle.com> wrote:

> 
> We are using the SM on the voltaire switch.
> 
> I could collect before & after snapshots of the port counters if I had a 
> way of knowing when the event was about to happen. The problem is I 
> don't. I guess we could run ibqueryerrors.pl every 5 seconds or so and 
> correlate this event based on the timestamp.

You will have to do something like that.  I don't know if they Voltaire SM has
any performance management they can run, check with them.  Last time we ran the
Voltaire SM it was the clear, run, and read procedure you describe.

Alternately, you could try OpenSM with the Performance Manager.  Here you would
have to read the errors periodically and look for trends.  I am just now getting
back to a new version of my plugin to not only store the data in MySQL but to
store a history of each read performed to get better historical data.

> 
> Is there some tracing I could turn on to dump out the reason for the 
> link bounce?

I am not sure the link bounced.  Perhaps you are just getting errors on the
link which will cause the ULP's to give up.  For example, RC QP's could be
going into error state after a number of failed packets.  I think that is why
Hal wanted you to look for errors.

> 
> Do you have some examples of the errors that can lead to such a link bounce?

If the link does "bounce" ie physically goes down while data is flowing over
it, look for the Symbol Errors and Xmit Discards to be "pegged".  We see this
when a cable is pulled accidentally or a node goes unresponsive in a running
job.  It will probably be easier to see these errors on the switch port.

Hope this helps,
Ira

> 
> - Sumeet
> 
> Hal Rosenstock wrote:
> > Hi,
> >
> > On Thu, Dec 18, 2008 at 3:28 AM, Sumeet Lahorani
> > <Sumeet.Lahorani at oracle.com> wrote:
> >   
> >> Hi,
> >>
> >> We sometimes see our IB interfaces go down and come back up within 2 or 3
> >> seconds for apparently no reason.
> >>     
> >
> > That can occur without cable pulling, etc. when certain errors are
> > present on the link.
> >
> >   
> >> Dec 17 14:47:23 dscbax14s kernel: ib0: multicast join failed for
> >> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> >>     
> >
> > -11 is EAGAIN
> >
> >   
> >> Dec 17 14:47:23 dscbax14s kernel: ib1: multicast join failed for
> >> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> >> Dec 17 14:47:23 dscbax14s kernel: bonding: bond0: link status down for idle
> >>  interface ib0, disabling it in 5000 ms.
> >> Dec 17 14:47:23 dscbax14s kernel: bonding: bond0: link status down for idle
> >>  interface ib1, disabling it in 5000 ms.
> >> Dec 17 14:47:25 dscbax14s kernel: bonding: bond0: link status up again after
> >> 2000 ms for interface ib0.
> >> Dec 17 14:47:25 dscbax14s kernel: bonding: bond0: link status up again after
> >> 2000 ms for interface ib1.
> >>
> >> To mask these we've set downdelay & updelay to 5000. But can anybody tell me
> >> why these interfaces could be bouncing down & up like this? We are not
> >> pulling any cables, resetting ports or resetting switches when this happens.
> >> We are using Voltaire ISR9024  switches & Mellanox Technologies MT25418
> >> [ConnectX IB DDR] HCAs.
> >>     
> >
> > Which SM flavor ?
> >
> > Would you dump out the port counters and see how they are change
> > before and after one of these "events" ?
> >
> > -- Hal
> >
> >   
> >> - Sumeet
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> >> http:// openib.org/mailman/listinfo/openib-general
> >>
> >>     
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> >   
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general
> 



More information about the general mailing list