[ewg] OFED-1.3.2-20080728-0355.tgz issues

Eli Cohen eli at dev.mellanox.co.il
Mon Sep 8 03:21:46 PDT 2008


On Thu, Sep 04, 2008 at 10:38:34AM -0400, Doug Ledford wrote:
 
> First, in kernel_patches/backports/2.6.18-EL5.1 there are two patches
> that go together, they are the ipoib-0400... and ipoib-0410...  Together
> they switch ipoib to use dev_set_mtu and rtnl_lock/rtnl_unlock.
> Unfortunately, these two patches are terminal.  Evidently the author
> never tried them in UD mode, only in RD mode.  I say this because there
> is a conditional operation in ipoib_mcast_join_task that is only
> executed if you are in UD mode.  That code (combined with the other
> patch), in UD mode, calls rtnl_lock(); dev_set_mtu(); rtnl_unlock().
> Well, that's all find and dandy any time ipoib_mcast_join_task() is
> called from kernel thread context where we aren't already holding the
> rtnl_lock().  But, it's highly broken when ipoib_mcast_join_task() is
> run as part of flush_workqueue() that's called from user context with
> the rtnl_lock() already held.  In any case, it's completely DOA in UD
> mode, just try running ifdown ib0, or rebooting the machine if your
> openibd init script wants to down ib0.
I think this problem is already addressed in this patch posted by
Roland so I think we can use it for ofed too.
a77a57a1a22afc31891d95879fe3cf2ab03838b0


> 
> Next, I was watching the compiler warnings as things compiled, and I got
> an array subscript out of bounds warning.  Turns out it was a legitimate
> error.  There is a thinko in ipoib_transport_dev_init (well, at least in
> the file I had here, I assume the final version of the files you have
> would end up the same or close to it once quilt has done what I did
> manually).  The snippet to fix the bug looks like this:
> 
> @@ -228,8 +233,9 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
>                 priv->rx_wr_draft[i].sg_list = &priv->sglist_draft[i][0];
>                 if (i < UD_POST_RCV_COUNT - 1)
>                         priv->rx_wr_draft[i].next = &priv->rx_wr_draft[i + 1];
> +               else
> +                       priv->rx_wr_draft[i].next = NULL;
>         }
> -       priv->rx_wr_draft[i].next = NULL;
> 
>         if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
>                 for (i = 0; i < UD_POST_RCV_COUNT; ++i) {
> 

Yes, that's a bug, and I think the reason it did not occur as a real
problem is because the whole ipoib_dev_priv struct is cleared at
allocation so the last entry was NULL due to that. Anyway, I think I
prefer this fix to the same problem:

Index: ofed_kernel-fixes/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- ofed_kernel-fixes.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2008-09-08 13:07:02.000000000 +0300
+++ ofed_kernel-fixes/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2008-09-08 13:08:41.000000000 +0300
@@ -234,7 +234,7 @@ int ipoib_transport_dev_init(struct net_
 		if (i < UD_POST_RCV_COUNT - 1)
 			priv->rx_wr_draft[i].next = &priv->rx_wr_draft[i + 1];
 	}
-	priv->rx_wr_draft[i].next = NULL;
+	priv->rx_wr_draft[UD_POST_RCV_COUNT - 1].next = NULL;
 
 	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
 		for (i = 0; i < UD_POST_RCV_COUNT; ++i) {

What do you think?



More information about the ewg mailing list