[ofa-general] [RFC] ipoib: avoid using stale ipoib_neigh* in ipoib_neigh_cleanup()
akepner at sgi.com
akepner at sgi.com
Fri May 22 11:44:03 PDT 2009
On Fri, May 22, 2009 at 08:34:57PM +0300, Yossi Etigin wrote:
> ...
> Interesting... what does it deadlock with?
> And what is the hole your fix leaves? If the (neigh!=NULL) check passes
> with the spinlock held, shouldn't it be OK to list_del() it?
>
Unfortunately, I don't have enough information to answer that
question any longer (it's an old, closed bug). But the crash dump
showed a hang like this:
PID: 8643 TASK: ffff810130f060c0 CPU: 3 COMMAND: "sshd"
.......
#3 [ffff81013b3d7ea0] .text.lock.spinlock at ffffffff802ea2df (via _spin_lock_i
rqsave)
#4 [ffff81013b3d7ea0] ipoib_neigh_cleanup at ffffffff883f8972
#5 [ffff81013b3d7ed0] neigh_destroy at ffffffff8029011c
crash> dis ipoib_neigh_cleanup
0xffffffff883f8952 <ipoib_neigh_cleanup>: push %r13
0xffffffff883f8954 <ipoib_neigh_cleanup+2>: push %r12
0xffffffff883f8956 <ipoib_neigh_cleanup+4>: push %rbp
0xffffffff883f8957 <ipoib_neigh_cleanup+5>: mov %rdi,%rbp
0xffffffff883f895a <ipoib_neigh_cleanup+8>: push %rbx
0xffffffff883f895b <ipoib_neigh_cleanup+9>: sub $0x8,%rsp
0xffffffff883f895f <ipoib_neigh_cleanup+13>: mov 0x18(%rdi),%rax
0xffffffff883f8963 <ipoib_neigh_cleanup+17>: lea 0x500(%rax),%r12
0xffffffff883f896a <ipoib_neigh_cleanup+24>: mov %r12,%rdi
0xffffffff883f896d <ipoib_neigh_cleanup+27>: callq 0xffffffff802ea1a5 <_spin_lock_irqsave>
0xffffffff883f8972 <ipoib_neigh_cleanup+32>: mov %rax,%rsi
.....
and here is the patch to ipoib_neigh_cleanup() that was in use:
diff -rup c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- c/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-04-22 13:25:23.131563415 -0700
+++ d/ofa_kernel-1.3/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-04-22 15:24:31.475721847 -0700
@@ -821,11 +821,15 @@ static void ipoib_neigh_cleanup(struct n
unsigned long flags;
struct ipoib_ah *ah = NULL;
+ spin_lock_irqsave(&priv->lock, flags);
neigh = *to_ipoib_neigh(n);
+ spin_unlock_irqrestore(&priv->lock, flags);
if (neigh) {
- priv = netdev_priv(neigh->dev);
- ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n",
- n->dev->name);
+ if (priv != netdev_priv(neigh->dev)) {
+ ipoib_dbg(priv, "neigh_destructor for bonding device: "
+ "%s\n", n->dev->name);
+ priv = netdev_priv(neigh->dev);
+ }
} else
return;
ipoib_dbg(priv,
@@ -835,10 +839,13 @@ static void ipoib_neigh_cleanup(struct n
spin_lock_irqsave(&priv->lock, flags);
- if (neigh->ah)
- ah = neigh->ah;
- list_del(&neigh->list);
- ipoib_neigh_free(n->dev, neigh);
+ neigh = *to_ipoib_neigh(n);
+ if (neigh) {
+ if (neigh->ah)
+ ah = neigh->ah;
+ list_del(&neigh->list);
+ ipoib_neigh_free(n->dev, neigh);
+ }
spin_unlock_irqrestore(&priv->lock, flags);
I'll see if I can reproduce the deadlock (with a new kernel).
--
Arthur
More information about the general
mailing list