[ofa-general] mpi failures on large ia64/ofed/IB clusters
akepner at sgi.com
akepner at sgi.com
Fri Oct 5 17:22:23 PDT 2007
On Fri, Oct 05, 2007 at 03:51:21PM -0700, Roland Dreier wrote:
> Another possibility (independent of the hack I suggested before) would
> be to add an mmiowb() before the mutex_unlock() in mthca_cmd_post().
>
> I actually have a good feeling about this theory....
>
Genius!
I have completed over 275 runs with the patch below, so
we can be very confident that this has fixed things.
Roland, should I submit a proper patch, or do you want
to take care of this? (And thanks alot, too!)
diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-06-21 07:38:47.000000000 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c 2007-10-05 16:04:38.926857822 -0700
@@ -288,7 +288,7 @@ static int mthca_cmd_post(struct mthca_d
else
err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier,
op_modifier, op, token, event);
-
+ mmiowb();
mutex_unlock(&dev->cmd.hcr_mutex);
return err;
}
--
Arthur
More information about the general
mailing list