[ofa-general] MVAPICH2 crashes on mixed fabric

Mike Heinz michael.heinz at qlogic.com
Fri Apr 4 13:08:18 PDT 2008


Hey, all, I'm not sure if this is a known bug or some sort of limitation
I'm unaware of, but I've been building and testing with the OFED 1.3 GA
release on a small fabric that has a mix of Arbel-based and newer
Connect-X HCAs.
 
What I've discovered is that mvapich and openmpi work fine across the
entire fabric, but mvapich2 crashes when I use a mix of Arbels and
Connect-X. The errors vary depending on the test program but here's an
example:
 
[mheinz at compute-0-0 IMB-3.0]$ mpirun -n 5 ./IMB-MPI1
.
.
.
(output snipped)
.
.
.

#-----------------------------------------------------------------------
------
# Benchmarking Sendrecv
# #processes = 2
# ( 3 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------
------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
Mbytes/sec
            0         1000         3.51         3.51         3.51
0.00
            1         1000         3.63         3.63         3.63
0.52
            2         1000         3.67         3.67         3.67
1.04
            4         1000         3.64         3.64         3.64
2.09
            8         1000         3.67         3.67         3.67
4.16
           16         1000         3.67         3.67         3.67
8.31
           32         1000         3.74         3.74         3.74
16.32
           64         1000         3.90         3.90         3.90
31.28
          128         1000         4.75         4.75         4.75
51.39
          256         1000         5.21         5.21         5.21
93.79
          512         1000         5.96         5.96         5.96
163.77
         1024         1000         7.88         7.89         7.89
247.54
         2048         1000        11.42        11.42        11.42
342.00
         4096         1000        15.33        15.33        15.33
509.49
         8192         1000        22.19        22.20        22.20
703.83
        16384         1000        34.57        34.57        34.57
903.88
        32768         1000        51.32        51.32        51.32
1217.94
        65536          640        85.80        85.81        85.80
1456.74
       131072          320       155.23       155.24       155.24
1610.40
       262144          160       301.84       301.86       301.85
1656.39
       524288           80       598.62       598.69       598.66
1670.31
      1048576           40      1175.22      1175.30      1175.26
1701.69
      2097152           20      2309.05      2309.05      2309.05
1732.32
      4194304           10      4548.72      4548.98      4548.85
1758.64
[0] Abort: Got FATAL event 3
 at line 796 in file ibv_channel_manager.c
rank 0 in job 1  compute-0-0.local_36049   caused collective abort of
all ranks
  exit status of rank 0: killed by signal 9

If, however, I define my mpdring to contain only Connect-X systems OR
only Arbel systems, IMB-MPI1 runs to completion.
 
Can any suggest a workaround or is this a real bug with mvapich2?
 
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080404/79fed611/attachment.html>


More information about the general mailing list