[ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup()
akepner at sgi.com
akepner at sgi.com
Thu May 21 14:00:49 PDT 2009
As a recap to this thrilling thread, I'm tracking down a panic with
a backtrace like this:
ib_ipoib:ipoib_neigh_cleanup+368
....
neigh_periodic_timer+0
run_timer_softirq+348
__do_softirq+85
call_softirq+30
do_softirq+44
.....
And the following helpful hint:
Unable to handle kernel paging request at 0000000000100108
^^^^^^^^^^^^^^^^
LIST_POISON1+0x8
So, we're in ipoib_neigh_cleanup(), doing the list_del():
static void ipoib_neigh_cleanup(struct neighbour *n)
{
.......
neigh = *to_ipoib_neigh(n);
.....
spin_lock_irqsave(&priv->lock, flags);
if (neigh->ah)
ah = neigh->ah;
list_del(&neigh->list);
ipoib_neigh_free(n->dev, neigh);
spin_unlock_irqrestore(&priv->lock, flags);
This has been practically impossible to reproduce (and I don't
even have the original crashdump available any longer).
What would prevent a race between a tx completion (with an
error) and the cleanup of a neighbour? In that case the tx
completion handler could do:
ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
........
if (wc->status != IB_WC_SUCCESS &&
wc->status != IB_WC_WR_FLUSH_ERR) {
.....
spin_lock_irqsave(&priv->lock, flags);
neigh = tx->neigh;
if (neigh) {
neigh->cm = NULL;
list_del(&neigh->list);
.......
spin_unlock_irqrestore(&priv->lock, flags);
While ipoib_neigh_cleanup() could grab the (now stale) neigh, and
crash like above.
(I've tried simulating tx completion failures to trigger this
behavior, but haven't gotten lucky yet....)
--
Arthur
More information about the general
mailing list