Hi all,<br><br>we have been struggling with the performance of a supermicro<br>(quad-core xeon) / qlogic (9024-FC) system running Debian, kernel<br>2.6.24-x86_64, and ofed-1.4 (from <a href="http://www.openfabrics.org/">http://www.openfabrics.org/</a>). <br>
There are 8 nodes attached to the switch.<br><br>What happens is that the performance of MPI global communication is<br>extremely low (i.e. ~ factor 10 when 16 procs out of only 2 nodes<br>communicate). This number comes from comparison with a *similar*<br>
system (dell/cisco). <br><br>Some test which we have performed:<br><br>* local memory bandwidth test ("stream" benchmark on 8-way node<br> returns >8GB/s)<br><br>* firmware: since the hca's are on-board supermicro (board_id:<br>
SM_2001000001; firmware-version: 1.2.0) I don't know how/where to<br> check adequacy.<br><br>* openib low-level communication tests seem okay (see output from<br> ib_write_lat, ib_write_bw below)<br><br>* However, I see errors of type "RcvSwRelayErrors" when checking<br>
"ibcheckerrors". Is this normal?<br><br>* Mpi benchmarks reveal slow all-to-all communication (see output<br> below for "osu_alltoall" test <br> <a href="https://mvapich.cse.ohio-state.edu/svn/mpi-benchmarks/branches/OMB-3.1/osu_alltoall.c">https://mvapich.cse.ohio-state.edu/svn/mpi-benchmarks/branches/OMB-3.1/osu_alltoall.c</a>,<br>
compiled with openmpi-1.3 and intel compiler 11.0)<br><br><br>Some questions I have:<br><br>1) Do I have to configure the switch? <br> So far I have not attempted to install the "ofed+" etc. software<br> which came with the qlogic hardware. Is there any chance that it<br>
would be compatible with ofed-1.4? Or even installable under Debian<br> (without too much tweaking)?<br><br>2) Is it okay for this system to run "opensm" on one of the nodes?<br> NOTE: the version is "OpenSM 3.2.5_20081207"<br>
<br>Any other lead or things I should test?<br><br>Thanks in advance,<br><br>MU<br><br>==============================================================<br>------------------------------------------------------------------<br>
RDMA_Write Latency Test<br>Inline data is used up to 400 bytes message<br>Connection type : RC<br>Mtu : 2048<br>------------------------------------------------------------------<br> #bytes #iterations t_min[usec] t_max[usec] t_typical[usec]<br>
2 1000 3.10 22.88 3.15<br> 4 1000 3.13 6.29 3.16<br> 8 1000 3.14 6.24 3.18<br> 16 1000 3.17 6.25 3.21<br>
32 1000 3.25 7.60 3.38<br> 64 1000 3.32 6.43 3.45<br> 128 1000 3.48 6.40 3.57<br> 256 1000 3.77 6.63 3.82<br>
512 1000 4.71 8.44 4.76<br> 1024 1000 5.58 7.53 5.63<br> 2048 1000 7.38 8.17 7.51<br> 4096 1000 8.64 9.04 8.77<br>
8192 1000 11.41 11.81 11.57<br> 16384 1000 16.55 17.27 16.71<br> 32768 1000 26.81 28.12 27.01<br> 65536 1000 47.41 49.43 47.62<br>
131072 1000 89.86 91.98 90.81<br> 262144 1000 174.25 176.34 175.35<br> 524288 1000 343.03 344.79 343.51<br>1048576 1000 679.04 680.57 679.72<br>
2097152 1000 1350.88 1352.80 1351.75<br>4194304 1000 2693.31 2696.13 2694.50<br>8388608 1000 5380.45 5383.29 5381.62<br>------------------------------------------------------------------<br>
------------------------------------------------------------------<br> RDMA_Write BW Test<br>Number of qp's running 1<br>Connection type : RC<br>Each Qp will post up to 100 messages each time<br>Mtu : 2048<br>
------------------------------------------------------------------<br> #bytes #iterations BW peak[MB/sec] BW average[MB/sec] <br> 2 5000 2.51 2.51<br> 4 5000 5.03 5.03<br>
8 5000 10.09 10.09<br> 16 5000 19.71 19.70<br> 32 5000 39.23 39.22<br> 64 5000 77.91 77.84<br>
128 5000 146.67 146.53<br> 256 5000 223.14 222.82<br> 512 5000 640.09 639.80<br> 1024 5000 1106.72 1106.22<br>
2048 5000 1271.61 1270.87<br> 4096 5000 1379.58 1379.44<br> 8192 5000 1446.01 1445.95<br> 16384 5000 1477.11 1477.09<br>
32768 5000 1498.18 1498.17<br> 65536 5000 1507.23 1507.22<br> 131072 5000 1511.83 1511.82<br> 262144 5000 1487.64 1487.62<br>
524288 5000 1485.76 1485.75<br>1048576 5000 1487.13 1486.54<br>2097152 5000 1487.95 1487.95<br>4194304 5000 1488.11 1488.10<br>
8388608 5000 1488.22 1488.22<br>------------------------------------------------------------------<br>***************OUR-SYSTEM /supermicro-qlogic:********************<br># OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1<br>
# Size Latency (us)<br>1 7.87<br>2 7.80<br>4 7.77<br>8 7.78<br>16 7.81<br>32 9.00<br>
64 9.00<br>128 10.15<br>256 11.75<br>512 15.55<br>1024 23.54<br>2048 40.57<br>4096 107.12<br>
8192 187.28<br>16384 343.61<br>32768 602.17<br>65536 1135.20<br>131072 3086.28<br>262144 9086.50<br>524288 18713.30<br>
1048576 37378.61<br>------------------------------------------------------------------<br>**************REFERENCE_SYSTEM / dell-cisco:***********************<br># OSU MPI All-to-All Personalized Exchange Latency Test v3.1.1<br>
# Size Latency (us)<br>1 16.14<br>2 15.93<br>4 16.25<br>8 16.60<br>16 25.83<br>32 28.66<br>
64 33.57<br>128 40.94<br>256 56.20<br>512 91.24<br>1024 156.13<br>2048 373.17<br>4096 696.95<br>
8192 1464.89<br>16384 1367.96<br>32768 2499.21<br>65536 5686.46<br>131072 11065.98<br>262144 23922.69<br>524288 49294.71<br>
1048576 101290.67<br>==============================================================<br><br>