[ewg] OFED-1.3.2-20080728-0355.tgz issues

Doug Ledford dledford at redhat.com
Thu Sep 4 07:38:34 PDT 2008


Hey,

I ran into some problems while working with this for our RHEL5.3
release.

First, in kernel_patches/backports/2.6.18-EL5.1 there are two patches
that go together, they are the ipoib-0400... and ipoib-0410...  Together
they switch ipoib to use dev_set_mtu and rtnl_lock/rtnl_unlock.
Unfortunately, these two patches are terminal.  Evidently the author
never tried them in UD mode, only in RD mode.  I say this because there
is a conditional operation in ipoib_mcast_join_task that is only
executed if you are in UD mode.  That code (combined with the other
patch), in UD mode, calls rtnl_lock(); dev_set_mtu(); rtnl_unlock().
Well, that's all find and dandy any time ipoib_mcast_join_task() is
called from kernel thread context where we aren't already holding the
rtnl_lock().  But, it's highly broken when ipoib_mcast_join_task() is
run as part of flush_workqueue() that's called from user context with
the rtnl_lock() already held.  In any case, it's completely DOA in UD
mode, just try running ifdown ib0, or rebooting the machine if your
openibd init script wants to down ib0.

Next, there are a number of high priority IBM ppc patches that never
made it to the last 1.3.2 tarball.  I don't know if that's because IBM
never submitted them or a tarball wasn't generated after their last
submits, but there were a number missing.

Next, I was watching the compiler warnings as things compiled, and I got
an array subscript out of bounds warning.  Turns out it was a legitimate
error.  There is a thinko in ipoib_transport_dev_init (well, at least in
the file I had here, I assume the final version of the files you have
would end up the same or close to it once quilt has done what I did
manually).  The snippet to fix the bug looks like this:

@@ -228,8 +233,9 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
                priv->rx_wr_draft[i].sg_list = &priv->sglist_draft[i][0];
                if (i < UD_POST_RCV_COUNT - 1)
                        priv->rx_wr_draft[i].next = &priv->rx_wr_draft[i + 1];
+               else
+                       priv->rx_wr_draft[i].next = NULL;
        }
-       priv->rx_wr_draft[i].next = NULL;

        if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
                for (i = 0; i < UD_POST_RCV_COUNT; ++i) {

(The above patch is no doubt mangled by my cut-n-paste, so don't bother
trying to actually use it).

Finally, the printk at the start of srp_event_handler as added by the
async event handler patch is *really* noisy here.  I would suggest
making it KERN_DEBUG instead of KERN_WARNING.

That's all I've got for now.  Although, I'll also point out that at
least in the same general release series, aka 1.3, 1.3.1, 1.3.2, it
would be helpful if you guys didn't rediff any of the patches in
kernel_patches but instead only added new patches.  In cases like this,
where I'm going from 1.3 to 1.3.2-pre, I'm not doing a wholesale
replacement of the code.  I did a three step process:

1) diff -uprN ofa_kernel-1.3{,.2}/{drivers,include} which gets me the
updates to the base code
2) diff -uprN ofa_kernel-1.3{,.2}/kernel_patches/fixes which gets me a
list of changed patches as well as patches that are new to 1.3.2
3) Same as #2 for kernel_patches/backports/2.6.18-EL5.1 (I can't really
use the EL5.2 since it only exists in one place and wouldn't show my
true change from version to version)

There's a little more of grabbing stuff here and there that's needed,
but those are the big points.  Anyway, I found that when a patch that's
applied very early in the patch process changes, the ability to back out
all the later patches, then back out the original version of the changed
patch, then apply the new version of the changed patch, and then reapply
the whole rest of the stack just wasn't feasible.  So, instead, I had
diffed the patches to get a diff of diffs, read those, and hand moved
any relevant changes into the code.  Painful.

-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20080904/74662a52/attachment.sig>


More information about the ewg mailing list