<html>

<body>

<font size=3>At 09:08 AM 10/1/2008, Ramiro Alba Queipo wrote:<br>

<blockquote type=cite class=cite cite="">Hi all,<br><br>

We have an infiniband cluster of 22 nodes witch 20 Gbps Mellanox<br>

MHGS18-XTC cards and I tried to make performance net tests both to

check<br>

hardware as to clarify concepts.<br><br>

Starting from the theoretic pick according to the infiniband card (in

my<br>

case 4X DDR => 20 Gbits/s => 2.5 Gbytes/s) we have some

limits:<br><br>

1) Bus type: PCIe 8x => 250 Mbytes/lane => 250 * 8 = 2

Gbytes/s<br><br>

2) According to a thread an users openmpi mail-list (???):<br><br>

  The 16 Gbit/s number is the theoretical peak, IB is coded 8/10

so<br>

  out of the 20 Gbit/s 16 is what you get. On SDR this number is

<br>

  (of course) 8 Gbit/s achievable (which is ~1000 MB/s) and I've

<br>

  seen well above 900 on MPI (this on 8x PCIe, 2x

margin)   <br>

  <br>

  Is this true?</font></blockquote><br>

IB uses 8b/10 encoding.  This results in a 20% overhead on every

frame.  Further, IB protocol - header, CRC, flow control credits,

etc. will consume additional bandwidth - the amount will vary with

workload and traffic patters.  Also, any fabric can experience

congestion which may reduce throughput for any given data flow. 

<br><br>

PCIe uses 8b/10b encoding for both 2.5GT/s and 5.0 GT/s signaling (the

next generation signaling is scrambled based so provides 2x the data

bandwidth with significantly less encoding overhead).  It also has

protocol overheads conceptually similar to IB which will consume

additional bandwidth (keep in mind many volume chipsets only support a

256B transaction size so a single IB frame may require 8-16 PCIe

transactions to process.   There will also be application /

device driver control messages between the host and the I/O device which

will consume additional bandwidth.   <br><br>

Also keep in mind that the actual application bandwidth may be further

gated by the memory subsystem, the I/O-to-memory latency, etc. so while

the theoretical bandwidths may be quite high, they will be constrained by

the interactions and the limitations within the overall hardware and

software stacks.  <br><br>

<br>

<blockquote type=cite class=cite cite=""><font size=3>3) According to

other comment in the same thread:<br><br>

  The data throughput limit for 8x PCIe is ~12 Gb/s. The

theoretical<br>

  limit is 16 Gb/s, but each PCIe packet has a whopping 20 byte<br>

  overhead. If the adapter uses 64 byte packets, then you see 1/3

of<br>

  the throughput go to overhead.<br><br>

  Could someone explain me that?</font></blockquote><br>

DMA Read completions are often returned one cache line at a time while

DMA Writes are often transmitted at the Max_Payload_Size of 256B (some

chipsets do coalesce completions allowing up to the Max_Payload_Size to

be returned).  Depending upon the mix of transactions required to

move an IB frame, the overheads may seem excessive.<br><br>

PCIe overheads vary with the transaction type, the flow control credit

exchanges, CRC, etc.   It is important to keep these in mind

when evaluating the solution.  <br><br>

<blockquote type=cite class=cite cite=""><font size=3>Then I got another

comment about the matter:<br><br>

The best uni-directional performance I have heard of for PCIe 8x IB<br>

DDR is ~1,400 MB/s (11.2 Gb/s) with Lustre, which is about 55% of

the<br>

theoretical 20 Gb/s advertised speed.<br><br>

<br>

---------------------------------------------------------------------<br>

<br>

<br>

Now, I did some tests (mpi used is OpenMPI) with the following

results:<br><br>

a) Using "Performance tests" from OFED 1.31<br>

      <br>

   ib_write_bw -a server ->  1347 MB/s<br><br>

b) Using hpcc (2 cores at diferent nodes) -> 1157 MB/s (--mca<br>

mpi_leave_pinned 1)<br><br>

c) Using "OSU Micro-Benchmarks" in "MPItests" from

OFED 1.3.1<br><br>

   1) 2 cores from different nodes<br><br>

    - mpirun -np 2 --hostfile pool osu_bibw -> 2001.29

MB/s<br>

(bidirectional)<br>

    - mpirun -np 2 --hostfile pool osu_bw -> 1311.31

MB/s<br><br>

   2) 2 cores from the same node<br><br>

    - mpirun -np 2  osu_bibw -> 2232 MB/s

(bidirectional)<br>

    - mpirun -np 2  osu_bw -> 2058 MB/s<br><br>

The questions are:<br><br>

- Are those results coherent with what it should be?<br>

- Why tests with the two core in the same node are better?<br>

- Should not the bidirectional test be a bit higher?<br>

- Why hpcc is so low? </font></blockquote><br>

You would need to provide more information about the system hardware, the

fabrics, etc. to make any rational response.  There are many

variables here and as I noted above, one cannot just derate the hardware

by a fixed percentage and conclude there is a real problem in the

solution stack.   He is more complex.   The question

you should ask is whether the micro-benchmarks you are executing are a

realistic reflection of the real workload.  If not, then do any of

these numbers matter at the end of the day especially if the total time

spent within the interconnect stacks are relatively small or

bursty.<br><br>

Mike</body>

</html>