[ofa-general] mpi failures on large ia64/ofed/IB clusters

Fri Oct 5 15:36:19 PDT 2007

On "large" IB-connected ia64 clusters, I (and some customers) are
seeing failures in MPI programs. This is commoner the bigger the
cluster nodes are, but I've seen it with as few as 32P/node.

I'm using "Mellanox Technologies MT23108 InfiniHost (rev a1)"
HCAs, with firmware version 3.5.0 (but this has been seen with
several firmware revisions) and OFED-1.2.

For example, with 2-128P systems connected via a single IB port,
using this simple MPI program:

int main(int argc, char **argv)
{
        MPI_Init(&argc, &argv);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Finalize();
        return 0;
}

and running it with something like:

# mpirun machine1, machine2 128 a.out

I see failures on >1% of runs.

On one run we got this in syslog (ib_mthca's debug_level set to 1):

 15:34:34 ib_mthca 0012:01:00.0: Command 21 completed with status 09
 15:35:34 ib_mthca 0012:01:00.0: HW2SW_MPT failed (-16)
 ....
(status 0x9==MTHCA_CMD_STAT_BAD_RES_STATE => problem with mpi?)

or on another run:

 13:57:15 ib_mthca 0005:01:00.0: Command 1a completed with status 01
 13:57:15 ib_mthca 0005:01:00.0: modify QP 1->2 returnedstatus 01.
 ....
(status 0x1==MTHCA_CMD_STAT_INTERNAL_ERR => ???)

These are just the first debug messages logged (rebooting between
each run), there are lots more, of almost every flavor.

Anyone else seen anything like this? Got any suggestions for debugging?
Should I be looking at MPI, or would you suspect a driver or h/w
problem? Any other info I could provide that'd help to narrow things
down?

Thanks for any pointers.

-- 
Arthur