[ofa-general] mpi failures on large ia64/ofed/IB clusters

Fri Oct 5 15:46:21 PDT 2007

 > On one run we got this in syslog (ib_mthca's debug_level set to 1):
 > 
 >  15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09
 >  15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16)
 >  ....
 > (status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?)
 > 
 > or on another run:
 > 
 >  13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01
 >  13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01.
 >  ....
 > (status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???)
 > 
 > These are just the first debug messages logged (rebooting between
 > each run), there are lots more, of almost every flavor.
 > 
 > Anyone else seen anything like this? Got any suggestions for debugging?
 > Should I be looking at MPI, or would you suspect a driver or h/w
 > problem? Any other info I could provide that'd help to narrow things
 > down?

Almost certainly this is a driver and/or firmware bug.  MPI and
userspace in general shouldn't be able to do anything that would cause
this type of error.

Given the semi-random nature of the error messages and the fact that
having nodes with lots of CPUs means FW commands are being submitted
in parallel, I have to suspect a race somewhere, possibly in firmware
but possibly in the driver.  You could try adding

	dev->cmd.max_cmds = 1;

to the beginning of mthca_cmd_use_events() as a hack, and see if you
still see problems.

I don't really see anything racy in the FW command stuff, but it's
possible that there's something like an mmiowb() missing somewhere (I
have a hard time spotting that type of race for some reason).

 - R.