IPoIB - "TX ring full" (was: Re: [ofa-general] Re: [PATCH v2] IB/ipoib: Split CQs for IPOIB UD)
akepner at sgi.com
akepner at sgi.com
Tue Apr 29 15:16:22 PDT 2008
On Tue, Apr 29, 2008 at 02:49:37PM -0700, Roland Dreier wrote:
> By the way, this isn't just theoretical -- I'm not smart enough to
> realize this except that I just saw:
>
> ib1: TX ring full, stopping kernel net queue
> NETDEV WATCHDOG: ib1: transmit timed out
> ib1: transmit timeout: latency 1240 msecs
> ib1: queue stopped 1, tx_head 5291313, tx_tail 5291255
>
It's very interesting to me that you mention this. I'm in the
midst of debugging a similar problem, but with IPoIB circa
OFED 1.2.
Found 2 problems:
1) In connected mode it's possible to get into a situation where
one (or more) IPoIB-CM send queues fill up (no completions
ever happen for them for some reason), while all the other
CM send queues are empty. Of course the empty TX queues don't
generate completions either, so nothing ever restarts the
xmit queue and one bad connection kills IPoIB. We have had
IPoIB stuck "forever" in this situation. Simple, brutal fix is
to do ipoib_flush_paths() in ipoib_timeout().
2) We also see situations very similar to what you describe above.
The IPoIB-UD send queue fills and never restarts. (Of course
it's nothing to do with the patch that was being discussed in
this thread, this is with OFED 1.2-rc2, and also OFED 1.2.)
I don't see how case (2) is possible with circa OFED 1.2 code. Can
anyone clue me in?
--
Arthur
More information about the general
mailing list