[ofa-general] IPoIB-UD TX timeouts (OFED 1.2)

akepner at sgi.com akepner at sgi.com
Wed Apr 30 12:23:54 PDT 2008


At a customer site running OFED 1.2 we are seeing the 
following - after ~10s of hours of stressing IPoIB,
the card apparently stops generating TX completions.
(These are MT25204 cards in x86_64 boxes, and we've seen
this with a couple f/w versions, including the latest.)

We get something like:

kernel: NETDEV WATCHDOG: ib0: transmit timed out
kernel: ib0: transmit timeout: latency 1972 msecs
kernel: ib0: queue stopped 1, tx_head 3271, tx_tail 3207

and that repeats "forever".

And to simplify things, we can produce this behavior in
datagram mode.

As long as only datagram mode is in use, the TX code in the
IPoIB driver seems quite straightforward. The only reason I
can imagine that we'd fail to get a timely TX completion
would be if link-level flow control were to throttle us. And
I'd expect that to be a transient condition... Am I
ovelooking something? Anyone seen similar? Suggestions for
debugging?

-- 
Arthur




More information about the general mailing list