[ofa-general] MVAPICH2 crashes on mixed fabric
Pavel Shamis (Pasha)
pasha at dev.mellanox.co.il
Sun Apr 6 08:04:14 PDT 2008
MVAPICH(1) and OMPI have HCA auto-detect system and both of them works
well on heterogeneous cluster.
I'm not sure about mvapich2 but I think that mvapich-discussion list
will be better place for this kind of question.
So I'm forwarding this mail to mvapich list.
Pasha.
Mike Heinz wrote:
> Hey, all, I'm not sure if this is a known bug or some sort of
> limitation I'm unaware of, but I've been building and testing with the
> OFED 1.3 GA release on a small fabric that has a mix of Arbel-based
> and newer Connect-X HCAs.
>
> What I've discovered is that mvapich and openmpi work fine across the
> entire fabric, but mvapich2 crashes when I use a mix of Arbels and
> Connect-X. The errors vary depending on the test program but here's an
> example:
>
> [mheinz at compute-0-0 IMB-3.0]$ mpirun -n 5 ./IMB-MPI1
> .
> .
> .
> (output snipped)
> .
> .
> .
>
> #-----------------------------------------------------------------------------
> # Benchmarking Sendrecv
> # #processes = 2
> # ( 3 additional processes waiting in MPI_Barrier)
> #-----------------------------------------------------------------------------
> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
> Mbytes/sec
> 0 1000 3.51 3.51
> 3.51 0.00
> 1 1000 3.63 3.63
> 3.63 0.52
> 2 1000 3.67 3.67
> 3.67 1.04
> 4 1000 3.64 3.64
> 3.64 2.09
> 8 1000 3.67 3.67
> 3.67 4.16
> 16 1000 3.67 3.67
> 3.67 8.31
> 32 1000 3.74 3.74
> 3.74 16.32
> 64 1000 3.90 3.90
> 3.90 31.28
> 128 1000 4.75 4.75
> 4.75 51.39
> 256 1000 5.21 5.21
> 5.21 93.79
> 512 1000 5.96 5.96
> 5.96 163.77
> 1024 1000 7.88 7.89
> 7.89 247.54
> 2048 1000 11.42 11.42
> 11.42 342.00
> 4096 1000 15.33 15.33
> 15.33 509.49
> 8192 1000 22.19 22.20
> 22.20 703.83
> 16384 1000 34.57 34.57
> 34.57 903.88
> 32768 1000 51.32 51.32 51.32
> 1217.94
> 65536 640 85.80 85.81 85.80
> 1456.74
> 131072 320 155.23 155.24 155.24
> 1610.40
> 262144 160 301.84 301.86 301.85
> 1656.39
> 524288 80 598.62 598.69 598.66
> 1670.31
> 1048576 40 1175.22 1175.30 1175.26
> 1701.69
> 2097152 20 2309.05 2309.05 2309.05
> 1732.32
> 4194304 10 4548.72 4548.98 4548.85
> 1758.64
> [0] Abort: Got FATAL event 3
> at line 796 in file ibv_channel_manager.c
> rank 0 in job 1 compute-0-0.local_36049 caused collective abort of
> all ranks
> exit status of rank 0: killed by signal 9
> If, however, I define my mpdring to contain only Connect-X systems OR
> only Arbel systems, IMB-MPI1 runs to completion.
>
> Can any suggest a workaround or is this a real bug with mvapich2?
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--
Pavel Shamis (Pasha)
Mellanox Technologies
More information about the general
mailing list