[ofa-general] Re: Kernel panic in IPoIB stability testing

Wed Feb 4 07:45:28 PST 2009

> I doubt it.  You are leaving a deleted path record as part of the path list.
> This is list corruption (since the list pointers themselves are part of the
> path record structure -- what if this returned storage is re-allocated?).
> 
You are right. 
> I think the correct fix (after your previous posted comment) is:
>         path = __path_find(dev, phdr->hwaddr + 4);
>         if (!path || !path->valid) {
>                 int had_path = 0;
>                 if (!path)
>                         path = path_rec_create(dev, phdr->hwaddr + 4);
>                 else
>                         had_path = 1;
>                 if (path) {
>                         /* put pseudoheader back on for next time */
>                         skb_push(skb, sizeof *phdr);
>                         __skb_queue_tail(&path->queue, skb);
> 
>                         if (path_rec_start(dev, path)) {
>                                 if (had_path) {
>                                         list_del(&path->list);
>                                         rb_erase(&path->rb_node,
>                                                  &priv->path_tree);
>                                 }
>                                 spin_unlock_irqrestore(&priv->lock, flags);
>                                 path_free(dev, path);
>                                 return;
>                         } else
>                                 __path_add(dev, path);
>                 } else {
>                         ++dev->stats.tx_dropped;
>                         dev_kfree_skb_any(skb);
>                 }
> 
>                 spin_unlock_irqrestore(&priv->lock, flags);
>                 return;
>         }
> 
> My only question here is:
> Do we have to worry about netif_tx_lock_bh(dev) (as taken in ipoib_flush_paths)?
> (If we do, we have a problem).
> 
Besides the locking issue that I hadn't think about yet what if we this fix looks the right thing to do.
But what if we leave the path without freeing it even if path_rec_start() fails?
This would leave a path which is not valid in path_list which is not forbidden state as 
I conclude (after all this is the state the function was called)
In this way, I think that we don't  have to worry about locks.

and the code will look like this

        if (!path || !path->valid) {
                if (!path)
                        path = path_rec_create(dev, phdr->hwaddr + 4);
                if (path) {
                        /* put pseudoheader back on for next time */
                        skb_push(skb, sizeof *phdr);
                        __skb_queue_tail(&path->queue, skb);

                        if (!path->query && path_rec_start(dev, path)) {
                                spin_unlock_irqrestore(&priv->lock, flags);
-                               path_free(dev, path);
                                return;
                        } else
                                __path_add(dev, path);
                } else {
                        ++priv->stats.tx_dropped;
                        dev_kfree_skb_any(skb);
                }

                spin_unlock_irqrestore(&priv->lock, flags);
                return;
        }