[ofa-general] Re: [ewg] OFED April 21 meeting summary

Mon Apr 28 09:37:31 PDT 2008

On Mon, Apr 28, 2008 at 07:14:39PM +0300, Olga Shern (Voltaire) wrote:
> ...
>   https://bugs.openfabrics.org/show_bug.cgi?id=985 we will try to reproduce
> it on upstream kernel and let you know

I just saw this bug report today, but we've had similar crashes. 
Looks like the problem is that in ipoib_neigh_cleanup() this is 
done (no locking):

    neigh = *to_ipoib_neigh(n);

then later:

      spin_lock_irqsave(&priv->lock, flags);
      if (neigh->ah)
               ah = neigh->ah;
      list_del(&neigh->list); <---- neigh may be stale now
      ipoib_neigh_free(n->dev, neigh);
      spin_unlock_irqrestore(&priv->lock, flags);

neigh wasn't re-read after acquiring the lock, so it may point
to an already freed data structure.

In our crashes we had backtraces like:

RIP: ib_ipoib:ipoib_neigh_cleanup+368
     neigh_destroy+197
     neigh_periodic_timer+249
     neigh_periodic_timer+0
     run_timer_softirq+348
     __do_softirq+85
     call_softirq+30
     do_softirq+44
     .....

And the following helpful hint:

Unable to handle kernel paging request at 0000000000100108
                                          ^^^^^^^^^^^^^^^^
                                          LIST_POISON1 + 0x8

So we were dying in the midst of list_del().

-- 
Arthur