[ewg] OFED-1.3.2-20080728-0355.tgz issues
Eli Cohen
eli at dev.mellanox.co.il
Mon Sep 8 03:21:46 PDT 2008
On Thu, Sep 04, 2008 at 10:38:34AM -0400, Doug Ledford wrote:
> First, in kernel_patches/backports/2.6.18-EL5.1 there are two patches
> that go together, they are the ipoib-0400... and ipoib-0410... Together
> they switch ipoib to use dev_set_mtu and rtnl_lock/rtnl_unlock.
> Unfortunately, these two patches are terminal. Evidently the author
> never tried them in UD mode, only in RD mode. I say this because there
> is a conditional operation in ipoib_mcast_join_task that is only
> executed if you are in UD mode. That code (combined with the other
> patch), in UD mode, calls rtnl_lock(); dev_set_mtu(); rtnl_unlock().
> Well, that's all find and dandy any time ipoib_mcast_join_task() is
> called from kernel thread context where we aren't already holding the
> rtnl_lock(). But, it's highly broken when ipoib_mcast_join_task() is
> run as part of flush_workqueue() that's called from user context with
> the rtnl_lock() already held. In any case, it's completely DOA in UD
> mode, just try running ifdown ib0, or rebooting the machine if your
> openibd init script wants to down ib0.
I think this problem is already addressed in this patch posted by
Roland so I think we can use it for ofed too.
a77a57a1a22afc31891d95879fe3cf2ab03838b0
>
> Next, I was watching the compiler warnings as things compiled, and I got
> an array subscript out of bounds warning. Turns out it was a legitimate
> error. There is a thinko in ipoib_transport_dev_init (well, at least in
> the file I had here, I assume the final version of the files you have
> would end up the same or close to it once quilt has done what I did
> manually). The snippet to fix the bug looks like this:
>
> @@ -228,8 +233,9 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
> priv->rx_wr_draft[i].sg_list = &priv->sglist_draft[i][0];
> if (i < UD_POST_RCV_COUNT - 1)
> priv->rx_wr_draft[i].next = &priv->rx_wr_draft[i + 1];
> + else
> + priv->rx_wr_draft[i].next = NULL;
> }
> - priv->rx_wr_draft[i].next = NULL;
>
> if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
> for (i = 0; i < UD_POST_RCV_COUNT; ++i) {
>
Yes, that's a bug, and I think the reason it did not occur as a real
problem is because the whole ipoib_dev_priv struct is cleared at
allocation so the last entry was NULL due to that. Anyway, I think I
prefer this fix to the same problem:
Index: ofed_kernel-fixes/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
===================================================================
--- ofed_kernel-fixes.orig/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2008-09-08 13:07:02.000000000 +0300
+++ ofed_kernel-fixes/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2008-09-08 13:08:41.000000000 +0300
@@ -234,7 +234,7 @@ int ipoib_transport_dev_init(struct net_
if (i < UD_POST_RCV_COUNT - 1)
priv->rx_wr_draft[i].next = &priv->rx_wr_draft[i + 1];
}
- priv->rx_wr_draft[i].next = NULL;
+ priv->rx_wr_draft[UD_POST_RCV_COUNT - 1].next = NULL;
if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
for (i = 0; i < UD_POST_RCV_COUNT; ++i) {
What do you think?
More information about the ewg
mailing list