[ofa-general] MVAPICH2 crashes on mixed fabric

Pavel Shamis (Pasha) pasha at dev.mellanox.co.il
Sun Apr 6 08:04:14 PDT 2008


MVAPICH(1) and OMPI have HCA auto-detect system and both of them works 
well on heterogeneous cluster.
I'm not sure about mvapich2 but I think that mvapich-discussion list 
will be better place for this kind of question.
So I'm forwarding this mail to mvapich list.

Pasha.

Mike Heinz wrote:
> Hey, all, I'm not sure if this is a known bug or some sort of 
> limitation I'm unaware of, but I've been building and testing with the 
> OFED 1.3 GA release on a small fabric that has a mix of Arbel-based 
> and newer Connect-X HCAs.
>  
> What I've discovered is that mvapich and openmpi work fine across the 
> entire fabric, but mvapich2 crashes when I use a mix of Arbels and 
> Connect-X. The errors vary depending on the test program but here's an 
> example:
>  
> [mheinz at compute-0-0 IMB-3.0]$ mpirun -n 5 ./IMB-MPI1
> .
> .
> .
> (output snipped)
> .
> .
> .
>
> #-----------------------------------------------------------------------------
> # Benchmarking Sendrecv
> # #processes = 2
> # ( 3 additional processes waiting in MPI_Barrier)
> #-----------------------------------------------------------------------------
>        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   
> Mbytes/sec
>             0         1000         3.51         3.51         
> 3.51         0.00
>             1         1000         3.63         3.63         
> 3.63         0.52
>             2         1000         3.67         3.67         
> 3.67         1.04
>             4         1000         3.64         3.64         
> 3.64         2.09
>             8         1000         3.67         3.67         
> 3.67         4.16
>            16         1000         3.67         3.67         
> 3.67         8.31
>            32         1000         3.74         3.74         
> 3.74        16.32
>            64         1000         3.90         3.90         
> 3.90        31.28
>           128         1000         4.75         4.75         
> 4.75        51.39
>           256         1000         5.21         5.21         
> 5.21        93.79
>           512         1000         5.96         5.96         
> 5.96       163.77
>          1024         1000         7.88         7.89         
> 7.89       247.54
>          2048         1000        11.42        11.42        
> 11.42       342.00
>          4096         1000        15.33        15.33        
> 15.33       509.49
>          8192         1000        22.19        22.20        
> 22.20       703.83
>         16384         1000        34.57        34.57        
> 34.57       903.88
>         32768         1000        51.32        51.32        51.32      
> 1217.94
>         65536          640        85.80        85.81        85.80      
> 1456.74
>        131072          320       155.23       155.24       155.24      
> 1610.40
>        262144          160       301.84       301.86       301.85      
> 1656.39
>        524288           80       598.62       598.69       598.66      
> 1670.31
>       1048576           40      1175.22      1175.30      1175.26      
> 1701.69
>       2097152           20      2309.05      2309.05      2309.05      
> 1732.32
>       4194304           10      4548.72      4548.98      4548.85      
> 1758.64
> [0] Abort: Got FATAL event 3
>  at line 796 in file ibv_channel_manager.c
> rank 0 in job 1  compute-0-0.local_36049   caused collective abort of 
> all ranks
>   exit status of rank 0: killed by signal 9
> If, however, I define my mpdring to contain only Connect-X systems OR 
> only Arbel systems, IMB-MPI1 runs to completion.
>  
> Can any suggest a workaround or is this a real bug with mvapich2?
>  
> --
> Michael Heinz
> Principal Engineer, Qlogic Corporation
> King of Prussia, Pennsylvania
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


-- 
Pavel Shamis (Pasha)
Mellanox Technologies




More information about the general mailing list