[ofa-general] IB interfaces occasionally go down & come up for no reason

Hal Rosenstock hal.rosenstock at gmail.com
Thu Dec 18 10:44:05 PST 2008


On Thu, Dec 18, 2008 at 11:54 AM, Sumeet Lahorani
<Sumeet.Lahorani at oracle.com> wrote:
>
> We are using the SM on the voltaire switch.

As Ira indicated, Voltaire also has a performance manager colocated with the SM.

> I could collect before & after snapshots of the port counters if I had a way
> of knowing when the event was about to happen. The problem is I don't.

How often does the link bounce ? Do you see the LEDs on that port
change ? How did you determine that the link bounces periodically ?

> I guess we could run ibqueryerrors.pl

Or just perfquery on the CA port which is thought to be bouncing.

> every 5 seconds or so and correlate this event based on the timestamp.

As Ira indicated, this may not be fruitful if the Voltaire performance
manager is resetting the error counters but it can't hurt to see if
any interesting counters change.

> Is there some tracing I could turn on to dump out the reason for the link
> bounce?

That may not be fruitful depending on the nature of the problem. The
error counters are the first level diagnostic on where to next look.
Also, the level of tracing will depend on what external tools you
have.

> Do you have some examples of the errors that can lead to such a link bounce?

See IBA 1.2.1 vol 2 p. 157 5.7 LINK PHYSICAL ERROR HANDLING

LinkErrorRecoveryCounter and LinkDownedCounters will count interesting
events at the physical level. One specific example that Ira pointed
out is a high rate (exceeding threshold) of SymbolErrors (minor
event). There are a number of other ones discussed in that section.

-- Hal

> - Sumeet
>
> Hal Rosenstock wrote:
>>
>> Hi,
>>
>> On Thu, Dec 18, 2008 at 3:28 AM, Sumeet Lahorani
>> <Sumeet.Lahorani at oracle.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> We sometimes see our IB interfaces go down and come back up within 2 or 3
>>> seconds for apparently no reason.
>>>
>>
>> That can occur without cable pulling, etc. when certain errors are
>> present on the link.
>>
>>
>>>
>>> Dec 17 14:47:23 dscbax14s kernel: ib0: multicast join failed for
>>> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
>>>
>>
>> -11 is EAGAIN
>>
>>
>>>
>>> Dec 17 14:47:23 dscbax14s kernel: ib1: multicast join failed for
>>> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
>>> Dec 17 14:47:23 dscbax14s kernel: bonding: bond0: link status down for
>>> idle
>>>  interface ib0, disabling it in 5000 ms.
>>> Dec 17 14:47:23 dscbax14s kernel: bonding: bond0: link status down for
>>> idle
>>>  interface ib1, disabling it in 5000 ms.
>>> Dec 17 14:47:25 dscbax14s kernel: bonding: bond0: link status up again
>>> after
>>> 2000 ms for interface ib0.
>>> Dec 17 14:47:25 dscbax14s kernel: bonding: bond0: link status up again
>>> after
>>> 2000 ms for interface ib1.
>>>
>>> To mask these we've set downdelay & updelay to 5000. But can anybody tell
>>> me
>>> why these interfaces could be bouncing down & up like this? We are not
>>> pulling any cables, resetting ports or resetting switches when this
>>> happens.
>>> We are using Voltaire ISR9024  switches & Mellanox Technologies MT25418
>>> [ConnectX IB DDR] HCAs.
>>>
>>
>> Which SM flavor ?
>>
>> Would you dump out the port counters and see how they are change
>> before and after one of these "events" ?
>>
>> -- Hal
>>
>>
>>>
>>> - Sumeet
>>>
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
>



More information about the general mailing list