Hi all,<br><br>we have been struggling with the performance of a supermicro<br>(quad-core xeon) / qlogic (9024-FC) system running Debian, kernel<br>2.6.24-x86_64, and ofed-1.4 (from <a href="http://www.openfabrics.org/">http://www.openfabrics.org/</a>). <br>
There are 8 nodes attached to the switch.<br><br>What happens is that the performance of MPI global communication is<br>extremely low (i.e. ~ factor 10 when 16 procs out of only 2 nodes<br>communicate). This number comes from comparison with a *similar*<br>
system (dell/cisco). <br><br>Some test which we have performed:<br><br>* local memory bandwidth test ("stream" benchmark on 8-way node<br>  returns >8GB/s)<br><br>* firmware: since the hca's are on-board supermicro (board_id:<br>
  SM_2001000001; firmware-version: 1.2.0) I don't know how/where to<br>  check adequacy.<br><br>* openib low-level communication tests seem okay (see output from<br>  ib_write_lat, ib_write_bw below)<br><br>* However, I see errors of type "RcvSwRelayErrors" when checking<br>
  "ibcheckerrors". Is this normal?<br><br>* Mpi benchmarks reveal slow all-to-all communication (see output<br>  below for "osu_alltoall" test <br>  <a href="https://mvapich.cse.ohio-state.edu/svn/mpi-benchmarks/branches/OMB-3.1/osu_alltoall.c">https://mvapich.cse.ohio-state.edu/svn/mpi-benchmarks/branches/OMB-3.1/osu_alltoall.c</a>,<br>
  compiled with openmpi-1.3 and intel compiler 11.0)<br><br><br>Some questions I have:<br><br>1) Do I have to configure the switch? <br>   So far I have not attempted to install the "ofed+" etc. software<br>   which came with the qlogic hardware. Is there any chance that it<br>
   would be compatible with ofed-1.4? Or even installable under Debian<br>   (without too much tweaking)?<br><br>2) Is it okay for this system to run "opensm" on one of the nodes?<br>   NOTE: the version is "OpenSM 3.2.5_20081207"<br>
<br>Any other lead or things I should test?<br><br>Thanks in advance,<br><br>MU<br><br>==============================================================<br>------------------------------------------------------------------<br>
                    RDMA_Write Latency Test<br>Inline data is used up to 400 bytes message<br>Connection type : RC<br>Mtu : 2048<br>------------------------------------------------------------------<br> #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]<br>
      2        1000           3.10          22.88             3.15<br>      4        1000           3.13           6.29             3.16<br>      8        1000           3.14           6.24             3.18<br>     16        1000           3.17           6.25             3.21<br>
     32        1000           3.25           7.60             3.38<br>     64        1000           3.32           6.43             3.45<br>    128        1000           3.48           6.40             3.57<br>    256        1000           3.77           6.63             3.82<br>
    512        1000           4.71           8.44             4.76<br>   1024        1000           5.58           7.53             5.63<br>   2048        1000           7.38           8.17             7.51<br>   4096        1000           8.64           9.04             8.77<br>
   8192        1000          11.41          11.81            11.57<br>  16384        1000          16.55          17.27            16.71<br>  32768        1000          26.81          28.12            27.01<br>  65536        1000          47.41          49.43            47.62<br>
 131072        1000          89.86          91.98            90.81<br> 262144        1000         174.25         176.34           175.35<br> 524288        1000         343.03         344.79           343.51<br>1048576        1000         679.04         680.57           679.72<br>
2097152        1000        1350.88        1352.80          1351.75<br>4194304        1000        2693.31        2696.13          2694.50<br>8388608        1000        5380.45        5383.29          5381.62<br>------------------------------------------------------------------<br>
------------------------------------------------------------------<br>                    RDMA_Write BW Test<br>Number of qp's running 1<br>Connection type : RC<br>Each Qp will post up to 100 messages each time<br>Mtu : 2048<br>
------------------------------------------------------------------<br> #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]  <br>      2        5000               2.51                  2.51<br>      4        5000               5.03                  5.03<br>
      8        5000              10.09                 10.09<br>     16        5000              19.71                 19.70<br>     32        5000              39.23                 39.22<br>     64        5000              77.91                 77.84<br>
    128        5000             146.67                146.53<br>    256        5000             223.14                222.82<br>    512        5000             640.09                639.80<br>   1024        5000            1106.72               1106.22<br>
   2048        5000            1271.61               1270.87<br>   4096        5000            1379.58               1379.44<br>   8192        5000            1446.01               1445.95<br>  16384        5000            1477.11               1477.09<br>
  32768        5000            1498.18               1498.17<br>  65536        5000            1507.23               1507.22<br> 131072        5000            1511.83               1511.82<br> 262144        5000            1487.64               1487.62<br>
 524288        5000            1485.76               1485.75<br>1048576        5000            1487.13               1486.54<br>2097152        5000            1487.95               1487.95<br>4194304        5000            1488.11               1488.10<br>
8388608        5000            1488.22               1488.22<br>------------------------------------------------------------------<br>***************OUR-SYSTEM /supermicro-qlogic:********************<br># OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1<br>
# Size            Latency (us)<br>1                         7.87<br>2                         7.80<br>4                         7.77<br>8                         7.78<br>16                        7.81<br>32                        9.00<br>
64                        9.00<br>128                      10.15<br>256                      11.75<br>512                      15.55<br>1024                     23.54<br>2048                     40.57<br>4096                    107.12<br>
8192                    187.28<br>16384                   343.61<br>32768                   602.17<br>65536                  1135.20<br>131072                 3086.28<br>262144                 9086.50<br>524288                18713.30<br>
1048576               37378.61<br>------------------------------------------------------------------<br>**************REFERENCE_SYSTEM / dell-cisco:***********************<br># OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1<br>
# Size            Latency (us)<br>1                        16.14<br>2                        15.93<br>4                        16.25<br>8                        16.60<br>16                       25.83<br>32                       28.66<br>
64                       33.57<br>128                      40.94<br>256                      56.20<br>512                      91.24<br>1024                    156.13<br>2048                    373.17<br>4096                    696.95<br>
8192                   1464.89<br>16384                  1367.96<br>32768                  2499.21<br>65536                  5686.46<br>131072                11065.98<br>262144                23922.69<br>524288                49294.71<br>
1048576              101290.67<br>==============================================================<br><br>