[ofa-general] mpi failures on large ia64/ofed/IB clusters

akepner at sgi.com akepner at sgi.com
Fri Oct 5 17:22:23 PDT 2007


On Fri, Oct 05, 2007 at 03:51:21PM -0700, Roland Dreier wrote:

> Another possibility (independent of the hack I suggested before) would
> be to add an mmiowb() before the mutex_unlock() in mthca_cmd_post().
> 
> I actually have a good feeling about this theory....
> 

Genius!

I have completed over 275 runs with the patch below, so 
we can be very confident that this has fixed things. 

Roland, should I submit a proper patch, or do you want 
to take care of this? (And thanks alot, too!)

diff -rup ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c
--- ofa_kernel-1.2.orig/drivers/infiniband/hw/mthca/mthca_cmd.c	2007-06-21 07:38:47.000000000 -0700
+++ ofa_kernel-1.2/drivers/infiniband/hw/mthca/mthca_cmd.c	2007-10-05 16:04:38.926857822 -0700
@@ -288,7 +288,7 @@ static int mthca_cmd_post(struct mthca_d
 	else
 		err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier,
 					 op_modifier, op, token, event);
-
+	mmiowb();
 	mutex_unlock(&dev->cmd.hcr_mutex);
 	return err;
 }


-- 
Arthur




More information about the general mailing list