[ofa-general] IB interfaces occasionally go down & come up for no reason

Hal Rosenstock hal.rosenstock at gmail.com
Thu Dec 18 08:37:45 PST 2008


Hi,

On Thu, Dec 18, 2008 at 3:28 AM, Sumeet Lahorani
<Sumeet.Lahorani at oracle.com> wrote:
>
> Hi,
>
> We sometimes see our IB interfaces go down and come back up within 2 or 3
> seconds for apparently no reason.

That can occur without cable pulling, etc. when certain errors are
present on the link.

> Dec 17 14:47:23 dscbax14s kernel: ib0: multicast join failed for
> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11

-11 is EAGAIN

> Dec 17 14:47:23 dscbax14s kernel: ib1: multicast join failed for
> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11
> Dec 17 14:47:23 dscbax14s kernel: bonding: bond0: link status down for idle
>  interface ib0, disabling it in 5000 ms.
> Dec 17 14:47:23 dscbax14s kernel: bonding: bond0: link status down for idle
>  interface ib1, disabling it in 5000 ms.
> Dec 17 14:47:25 dscbax14s kernel: bonding: bond0: link status up again after
> 2000 ms for interface ib0.
> Dec 17 14:47:25 dscbax14s kernel: bonding: bond0: link status up again after
> 2000 ms for interface ib1.
>
> To mask these we've set downdelay & updelay to 5000. But can anybody tell me
> why these interfaces could be bouncing down & up like this? We are not
> pulling any cables, resetting ports or resetting switches when this happens.
> We are using Voltaire ISR9024  switches & Mellanox Technologies MT25418
> [ConnectX IB DDR] HCAs.

Which SM flavor ?

Would you dump out the port counters and see how they are change
before and after one of these "events" ?

-- Hal

> - Sumeet
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>



More information about the general mailing list