[ofa-general] MVAPICH2 crashes on mixed fabric
Mike Heinz
michael.heinz at qlogic.com
Tue Apr 8 07:27:55 PDT 2008
Wei,
No joy. The following command:
+ /usr/mpi/pgi/mvapich2-1.0.2/bin/mpiexec -1 -machinefile
/home/mheinz/mvapich2-pgi/mpi_hosts -n 4 -env MV2_USE_COALESCE 0 -env
MV2_VBUF_TOTAL_SIZE 9216 PMB2.2.1/SRC_PMB/PMB-MPI1
Produced the following error:
[0] Abort: Got FATAL event 3
at line 796 in file ibv_channel_manager.c
rank 0 in job 48 compute-0-3.local_33082 caused collective abort of
all ranks
exit status of rank 0: killed by signal 9
+ set +x
Note that compute-0-3 has a connect-x HCA.
If I restrict the ring to only nodes with connect-x the problem does not
occur.
This isn't a huge problem for me; this 4-node cluster is actually for
testing the creation of Rocks Rolls and I can simply record it as a
known limitation when using mvapich2 - but it could impact users in the
field if a cluster gets extended with newer HCAs.
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
-----Original Message-----
From: wei huang [mailto:huanwei at cse.ohio-state.edu]
Sent: Sunday, April 06, 2008 8:58 PM
To: Mike Heinz
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] MVAPICH2 crashes on mixed fabric
Hi Mike,
Currently mvapich2 will detect different HCA type and thus select
different parameters for communication, which may cause the problem. We
are working on this feature and it will be available in our next
release.
For now, if you want to run on this setup, please set few environmental
variables like:
mpiexec -n 2 -env MV2_USE_COALESCE 0 -env MV2_VBUF_TOTAL_SIZE 9216
./a.out
Please let us know if this works. Thanks.
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering Ohio State University OH 43210
Tel: (614)292-8501
On Fri, 4 Apr 2008, Mike Heinz wrote:
> Hey, all, I'm not sure if this is a known bug or some sort of
> limitation I'm unaware of, but I've been building and testing with the
> OFED 1.3 GA release on a small fabric that has a mix of Arbel-based
> and newer Connect-X HCAs.
>
> What I've discovered is that mvapich and openmpi work fine across the
> entire fabric, but mvapich2 crashes when I use a mix of Arbels and
> Connect-X. The errors vary depending on the test program but here's an
> example:
>
> [mheinz at compute-0-0 IMB-3.0]$ mpirun -n 5 ./IMB-MPI1 .
> .
> .
> (output snipped)
> .
> .
> .
>
> #---------------------------------------------------------------------
> --
> ------
> # Benchmarking Sendrecv
> # #processes = 2
> # ( 3 additional processes waiting in MPI_Barrier)
> #---------------------------------------------------------------------
> --
> ------
> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
> Mbytes/sec
> 0 1000 3.51 3.51 3.51
> 0.00
> 1 1000 3.63 3.63 3.63
> 0.52
> 2 1000 3.67 3.67 3.67
> 1.04
> 4 1000 3.64 3.64 3.64
> 2.09
> 8 1000 3.67 3.67 3.67
> 4.16
> 16 1000 3.67 3.67 3.67
> 8.31
> 32 1000 3.74 3.74 3.74
> 16.32
> 64 1000 3.90 3.90 3.90
> 31.28
> 128 1000 4.75 4.75 4.75
> 51.39
> 256 1000 5.21 5.21 5.21
> 93.79
> 512 1000 5.96 5.96 5.96
> 163.77
> 1024 1000 7.88 7.89 7.89
> 247.54
> 2048 1000 11.42 11.42 11.42
> 342.00
> 4096 1000 15.33 15.33 15.33
> 509.49
> 8192 1000 22.19 22.20 22.20
> 703.83
> 16384 1000 34.57 34.57 34.57
> 903.88
> 32768 1000 51.32 51.32 51.32
> 1217.94
> 65536 640 85.80 85.81 85.80
> 1456.74
> 131072 320 155.23 155.24 155.24
> 1610.40
> 262144 160 301.84 301.86 301.85
> 1656.39
> 524288 80 598.62 598.69 598.66
> 1670.31
> 1048576 40 1175.22 1175.30 1175.26
> 1701.69
> 2097152 20 2309.05 2309.05 2309.05
> 1732.32
> 4194304 10 4548.72 4548.98 4548.85
> 1758.64
> [0] Abort: Got FATAL event 3
> at line 796 in file ibv_channel_manager.c
> rank 0 in job 1 compute-0-0.local_36049 caused collective abort of
> all ranks
> exit status of rank 0: killed by signal 9
>
> If, however, I define my mpdring to contain only Connect-X systems OR
> only Arbel systems, IMB-MPI1 runs to completion.
>
> Can any suggest a workaround or is this a real bug with mvapich2?
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
>
>
More information about the general
mailing list