[ofa-general] Infiniband bandwidth

Wed Oct 1 09:08:46 PDT 2008

Hi all,

We have an infiniband cluster of 22 nodes witch 20 Gbps Mellanox
MHGS18-XTC cards and I tried to make performance net tests both to check
hardware as to clarify concepts.

Starting from the theoretic pick according to the infiniband card (in my
case 4X DDR => 20 Gbits/s => 2.5 Gbytes/s) we have some limits:

1) Bus type: PCIe 8x => 250 Mbytes/lane => 250 * 8 = 2 Gbytes/s

2) According to a thread an users openmpi mail-list (???):

  The 16 Gbit/s number is the theoretical peak, IB is coded 8/10 so
  out of the 20 Gbit/s 16 is what you get. On SDR this number is 
  (of course) 8 Gbit/s achievable (which is ~1000 MB/s) and I've 
  seen well above 900 on MPI (this on 8x PCIe, 2x margin)   

  Is this true?

3) According to other comment in the same thread:

  The data throughput limit for 8x PCIe is ~12 Gb/s. The theoretical
  limit is 16 Gb/s, but each PCIe packet has a whopping 20 byte
  overhead. If the adapter uses 64 byte packets, then you see 1/3 of
  the throughput go to overhead.

  Could someone explain me that?

Then I got another comment about the matter:

The best uni-directional performance I have heard of for PCIe 8x IB
DDR is ~1,400 MB/s (11.2 Gb/s) with Lustre, which is about 55% of the
theoretical 20 Gb/s advertised speed.

---------------------------------------------------------------------

Now, I did some tests (mpi used is OpenMPI) with the following results:

a) Using "Performance tests" from OFED 1.31

   ib_write_bw -a server ->  1347 MB/s

b) Using hpcc (2 cores at diferent nodes) -> 1157 MB/s (--mca
mpi_leave_pinned 1)

c) Using "OSU Micro-Benchmarks" in "MPItests" from OFED 1.3.1

   1) 2 cores from different nodes

    - mpirun -np 2 --hostfile pool osu_bibw -> 2001.29 MB/s
(bidirectional)
    - mpirun -np 2 --hostfile pool osu_bw -> 1311.31 MB/s

   2) 2 cores from the same node

    - mpirun -np 2  osu_bibw -> 2232 MB/s (bidirectional)
    - mpirun -np 2  osu_bw -> 2058 MB/s

The questions are:

- Are those results coherent with what it should be?
- Why tests with the two core in the same node are better?
- Should not the bidirectional test be a bit higher?
- Why hpcc is so low? 

Thanks in advance 

Regards

-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk