[ofa-general] MVAPICH2 crashes on mixed fabric

wei huang huanwei at cse.ohio-state.edu
Wed Apr 9 08:18:19 PDT 2008


Hi Mike,

Is the arbel based DDR cards? If so, try put:

-env MV2_DEFAULT_MTU IBV_MTU_2048

in addition to the environmental variables you are using. Thanks.

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Tue, 8 Apr 2008, Mike Heinz wrote:

> Wei,
>
> No joy. The following command:
>
> + /usr/mpi/pgi/mvapich2-1.0.2/bin/mpiexec -1 -machinefile
> /home/mheinz/mvapich2-pgi/mpi_hosts -n 4 -env MV2_USE_COALESCE 0 -env
> MV2_VBUF_TOTAL_SIZE 9216 PMB2.2.1/SRC_PMB/PMB-MPI1
>
> Produced the following error:
>
> [0] Abort: Got FATAL event 3
>  at line 796 in file ibv_channel_manager.c
> rank 0 in job 48  compute-0-3.local_33082   caused collective abort of
> all ranks
>   exit status of rank 0: killed by signal 9
> + set +x
>
> Note that compute-0-3 has a connect-x HCA.
>
> If I restrict the ring to only nodes with connect-x the problem does not
> occur.
>
> This isn't a huge problem for me; this 4-node cluster is actually for
> testing the creation of Rocks Rolls and I can simply record it as a
> known limitation when using mvapich2 - but it could impact users in the
> field if a cluster gets extended with newer HCAs.
>
>
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>
> -----Original Message-----
> From: wei huang [mailto:huanwei at cse.ohio-state.edu]
> Sent: Sunday, April 06, 2008 8:58 PM
> To: Mike Heinz
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] MVAPICH2 crashes on mixed fabric
>
> Hi Mike,
>
> Currently mvapich2 will detect different HCA type and thus select
> different parameters for communication, which may cause the problem. We
> are working on this feature and it will be available in our next
> release.
> For now, if you want to run on this setup, please set few environmental
> variables like:
>
> mpiexec -n 2 -env MV2_USE_COALESCE 0 -env MV2_VBUF_TOTAL_SIZE 9216
> ./a.out
>
> Please let us know if this works. Thanks.
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering Ohio State University OH 43210
> Tel: (614)292-8501
>
>
> On Fri, 4 Apr 2008, Mike Heinz wrote:
>
> > Hey, all, I'm not sure if this is a known bug or some sort of
> > limitation I'm unaware of, but I've been building and testing with the
>
> > OFED 1.3 GA release on a small fabric that has a mix of Arbel-based
> > and newer Connect-X HCAs.
> >
> > What I've discovered is that mvapich and openmpi work fine across the
> > entire fabric, but mvapich2 crashes when I use a mix of Arbels and
> > Connect-X. The errors vary depending on the test program but here's an
> > example:
> >
> > [mheinz at compute-0-0 IMB-3.0]$ mpirun -n 5 ./IMB-MPI1 .
> > .
> > .
> > (output snipped)
> > .
> > .
> > .
> >
> > #---------------------------------------------------------------------
> > --
> > ------
> > # Benchmarking Sendrecv
> > # #processes = 2
> > # ( 3 additional processes waiting in MPI_Barrier)
> > #---------------------------------------------------------------------
> > --
> > ------
> >        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> > Mbytes/sec
> >             0         1000         3.51         3.51         3.51
> > 0.00
> >             1         1000         3.63         3.63         3.63
> > 0.52
> >             2         1000         3.67         3.67         3.67
> > 1.04
> >             4         1000         3.64         3.64         3.64
> > 2.09
> >             8         1000         3.67         3.67         3.67
> > 4.16
> >            16         1000         3.67         3.67         3.67
> > 8.31
> >            32         1000         3.74         3.74         3.74
> > 16.32
> >            64         1000         3.90         3.90         3.90
> > 31.28
> >           128         1000         4.75         4.75         4.75
> > 51.39
> >           256         1000         5.21         5.21         5.21
> > 93.79
> >           512         1000         5.96         5.96         5.96
> > 163.77
> >          1024         1000         7.88         7.89         7.89
> > 247.54
> >          2048         1000        11.42        11.42        11.42
> > 342.00
> >          4096         1000        15.33        15.33        15.33
> > 509.49
> >          8192         1000        22.19        22.20        22.20
> > 703.83
> >         16384         1000        34.57        34.57        34.57
> > 903.88
> >         32768         1000        51.32        51.32        51.32
> > 1217.94
> >         65536          640        85.80        85.81        85.80
> > 1456.74
> >        131072          320       155.23       155.24       155.24
> > 1610.40
> >        262144          160       301.84       301.86       301.85
> > 1656.39
> >        524288           80       598.62       598.69       598.66
> > 1670.31
> >       1048576           40      1175.22      1175.30      1175.26
> > 1701.69
> >       2097152           20      2309.05      2309.05      2309.05
> > 1732.32
> >       4194304           10      4548.72      4548.98      4548.85
> > 1758.64
> > [0] Abort: Got FATAL event 3
> >  at line 796 in file ibv_channel_manager.c
> > rank 0 in job 1  compute-0-0.local_36049   caused collective abort of
> > all ranks
> >   exit status of rank 0: killed by signal 9
> >
> > If, however, I define my mpdring to contain only Connect-X systems OR
> > only Arbel systems, IMB-MPI1 runs to completion.
> >
> > Can any suggest a workaround or is this a real bug with mvapich2?
> >
> > --
> > Michael Heinz
> > Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania
> >
> >
>
>
>




More information about the general mailing list