[ofa-general] mpi failures on large ia64/ofed/IB clusters
Roland Dreier
rdreier at cisco.com
Fri Oct 5 15:46:21 PDT 2007
> On one run we got this in syslog (ib_mthca's debug_level set to 1):
>
> 15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09
> 15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16)
> ....
> (status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?)
>
> or on another run:
>
> 13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01
> 13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01.
> ....
> (status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???)
>
> These are just the first debug messages logged (rebooting between
> each run), there are lots more, of almost every flavor.
>
> Anyone else seen anything like this? Got any suggestions for debugging?
> Should I be looking at MPI, or would you suspect a driver or h/w
> problem? Any other info I could provide that'd help to narrow things
> down?
Almost certainly this is a driver and/or firmware bug. MPI and
userspace in general shouldn't be able to do anything that would cause
this type of error.
Given the semi-random nature of the error messages and the fact that
having nodes with lots of CPUs means FW commands are being submitted
in parallel, I have to suspect a race somewhere, possibly in firmware
but possibly in the driver. You could try adding
dev->cmd.max_cmds = 1;
to the beginning of mthca_cmd_use_events() as a hack, and see if you
still see problems.
I don't really see anything racy in the FW command stuff, but it's
possible that there's something like an mmiowb() missing somewhere (I
have a hard time spotting that type of race for some reason).
- R.
More information about the general
mailing list