IPoIB - "TX ring full" (was: Re: [ofa-general] Re: [PATCH v2] IB/ipoib: Split CQs for IPOIB UD)

Tue Apr 29 15:16:22 PDT 2008

On Tue, Apr 29, 2008 at 02:49:37PM -0700, Roland Dreier wrote:
> By the way, this isn't just theoretical -- I'm not smart enough to
> realize this except that I just saw:
> 
>     ib1: TX ring full, stopping kernel net queue
>     NETDEV WATCHDOG: ib1: transmit timed out
>     ib1: transmit timeout: latency 1240 msecs
>     ib1: queue stopped 1, tx_head 5291313, tx_tail 5291255
> 

It's very interesting to me that you mention this. I'm in the 
midst of debugging a similar problem, but with IPoIB circa 
OFED 1.2.

Found 2 problems:

1) In connected mode it's possible to get into a situation where 
   one (or more) IPoIB-CM send queues fill up (no completions 
   ever happen for them for some reason), while all the other 
   CM send queues are empty. Of course the empty TX queues don't 
   generate completions either, so nothing ever restarts the 
   xmit queue and one bad connection kills IPoIB. We have had 
   IPoIB stuck "forever" in this situation. Simple, brutal fix is 
   to do ipoib_flush_paths() in ipoib_timeout().

2) We also see situations very similar to what you describe above. 
   The IPoIB-UD send queue fills and never restarts. (Of course 
   it's nothing to do with the patch that was being discussed in 
   this thread, this is with OFED 1.2-rc2, and also OFED 1.2.)

I don't see how case (2) is possible with circa OFED 1.2 code. Can 
anyone clue me in? 

-- 
Arthur