[ofa-general] [PATCH] ipoib: null tx/rx_ring skb pointers on free

akepner at sgi.com akepner at sgi.com
Thu Nov 6 08:40:05 PST 2008


On Thu, Nov 06, 2008 at 05:12:50PM +0200, Jack Morgenstein wrote:
> On Thursday 06 November 2008 03:23, akepner at sgi.com wrote:
> > I described an IPoIB-related panic we were seeing on large 
> > clusters. The signature was a backtrace like this:
> > 
> >         skb_over_panic
> >         :ib_ipoib:ipoib_ib_handle_rx_wc
> >         :ib_ipoib:ipoib_poll
> >         net_rx_action
> >         .....
> > 
> > The bug is difficult to reproduce, but we finally got a crashdump, 
> > and the problem appears to be that stale skb pointers on the tx_ring 
> > were left pointing to skbs that had been since reused, so that the 
> > skb's data region was now unexpectedly short, etc. 
> > 
> How does ipoib_ib_handle_rx_wc() involve the tx_ring? This is 
> receive processing.
> 

What I surmise may be happening is something like this:

- tx skb is freed, but a stale pointer remains on tx_ring
- the same skb is reallocated, and added to the rx_ring
- now we get an 'unexpected' tx completion, and use the stale 
  skb pointer on the tx_ring to again free the skb (this step 
  seems to invoke a f/w bug)
- another driver, say an ethernet driver, reallocates the skb, 
  reducing the extent of the data region (leading to the 
  skb_over_panic once it's processed by ipoib_ib_handle_rx_wc)


This bug leaves the tx and rx rings corrupted in many ways, 
including:

- different rx_ring members refer to the same skb
- different skbs on the rx_ring have identical data, head, end, tail ptrs
- skbs on the rx_ring have sizes inconsistent with what the ipoib 
  driver allocates (which causes the skb_over_panic, of course)
- rx skbs have 'dev' pointers to ethernet devices 
- dma mappings in rx_ring aren't consistent with what's in skb
- some skbs are simultaneously on the tx and rx rings

-- 
Arthur




More information about the general mailing list