[ofa-general] IPoIB-UD TX timeouts (OFED 1.2)
akepner at sgi.com
akepner at sgi.com
Wed Apr 30 12:23:54 PDT 2008
At a customer site running OFED 1.2 we are seeing the
following - after ~10s of hours of stressing IPoIB,
the card apparently stops generating TX completions.
(These are MT25204 cards in x86_64 boxes, and we've seen
this with a couple f/w versions, including the latest.)
We get something like:
kernel: NETDEV WATCHDOG: ib0: transmit timed out
kernel: ib0: transmit timeout: latency 1972 msecs
kernel: ib0: queue stopped 1, tx_head 3271, tx_tail 3207
and that repeats "forever".
And to simplify things, we can produce this behavior in
datagram mode.
As long as only datagram mode is in use, the TX code in the
IPoIB driver seems quite straightforward. The only reason I
can imagine that we'd fail to get a timely TX completion
would be if link-level flow control were to throttle us. And
I'd expect that to be a transient condition... Am I
ovelooking something? Anyone seen similar? Suggestions for
debugging?
--
Arthur
More information about the general
mailing list