[ofa-general] IPoIB-UD TX timeouts (OFED 1.2)
Eli Cohen
eli at dev.mellanox.co.il
Wed Apr 30 13:00:55 PDT 2008
Artur,
when it happens please:
1. Check the link error counters.
2. Disconnect and reconnect the cable and see if it recovers.
On 4/30/08, akepner at sgi.com <akepner at sgi.com> wrote:
>
> At a customer site running OFED 1.2 we are seeing the
> following - after ~10s of hours of stressing IPoIB,
> the card apparently stops generating TX completions.
> (These are MT25204 cards in x86_64 boxes, and we've seen
> this with a couple f/w versions, including the latest.)
>
> We get something like:
>
> kernel: NETDEV WATCHDOG: ib0: transmit timed out
> kernel: ib0: transmit timeout: latency 1972 msecs
> kernel: ib0: queue stopped 1, tx_head 3271, tx_tail 3207
>
> and that repeats "forever".
>
> And to simplify things, we can produce this behavior in
> datagram mode.
>
> As long as only datagram mode is in use, the TX code in the
> IPoIB driver seems quite straightforward. The only reason I
> can imagine that we'd fail to get a timely TX completion
> would be if link-level flow control were to throttle us. And
> I'd expect that to be a transient condition... Am I
> ovelooking something? Anyone seen similar? Suggestions for
> debugging?
>
> --
> Arthur
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list