<html>

<body>

<br>

Off-line someone asked me to clarify my earlier e-mail.  Given this

discussion continues, perhaps this might help explain the performance a

bit more.  The Max Payload Size quoted here is what is typically

implemented on x86 chipsets though other chipsets may use a larger

value.  From a pure bandwidth perspective (which is not typical of

many applications), this should be reasonable accurate.  In any

case, this is just a fyi.<br><br>

<br>

A x4 IB 5 GT/s is 20 Gbps raw (customers do comprehend the marketing hype

does not translate into that bandwidth being available for applications -

I have had to explain this to the press in the past about how raw does

equal application available bandwidth).   Take off 8b/10b,

protocol overheads, etc. and assuming a 2KB PMTU, then one can expect to

hit perhaps 14-15 Gbps per direction depending upon the

workload.   Let's assume an aggregate of 30 Gbps of potential

application bandwidth for simplicity.   The PCIe x8 2.5 GT/s is

20 Gbps raw so take off the 8b/10b, protocol overheads, control /

application overheads, etc. and given it uses at most a 256B Max Payload

Size on DMA Writes and cache line sized DMA Read Completions (64B) though

many people use PIO Writes to avoid DMA Reads when it comes to

micro-benchmarks, the actual performance is unlikely to hit what IB might

drive depending upon the direction and mix of control and application

data transactions.  Add in the impacts on memory controller which in

real-world applications is servicing the processors quite a bit more than

illustrated by micro-benchmarks and the ability of a system to drive an

IB x4 DDR device at link rate is very questionable.  <br><br>

The question is whether this really matters.  If you examine most

workloads on various platforms, they simply cannot generate enough

bandwidth to consume the external I/O bandwidth capacity.  In many

cases, they are constrained by the processor or the combination of the

processor / memory components.  This isn't a bad thing when you

think about it.  For many customers, it means that the attached I/O

fabrics will be sufficiently provisioned to eliminate or largely mitigate

the impacts of external fabric events, e.g. congestion, and deliver a

reasonable solution using the existing hardware (issues of topology, use

of multi-path, etc. all come into bearing as a function of fabric

diameter).    In the end, customers care about whether the

application performs as expected and where the real bottlenecks

lie.  For most applications, it will come down to the processor /

memory subsystems and not the I/O or external fabric.  <br><br>

While I haven't seen all of the latest DDR micro-benchmark results, I

believe the x4 IB SDR numbers largely align with what I've outlined

here.  <br><br>

Mike<br><br>

<br><br>

<br><br>

At 02:09 AM 10/6/2006, john t wrote:<br>

<blockquote type=cite class=cite cite="">Hi Shannon,<br>

 <br>

The bandwidth figures that you quoted below match with my readings for

single port Mellanox DDR HCA (both for unidirection and bidirection). So

it seems dual port SDR HCA performs as good as single port DDR HCA. It

would help if you can also tell the bandwidth that you got using one port

of your dual-port SDR HCA card. Was it half the bandwidth that you stated

below, which means having two SDR ports per HCA helps. <br>

 <br>

In my case it seems having two ports (DDR) per HCA does not increase BW,

since PCI-e x8 limit is 16 Gb/sec per direction and each of the two HCA

ports (DDR) though capable of transferring 16 Gb/sec in each direction,

when used together can not go above 16 Gb/sec. <br>

 <br>

Regards,<br>

John T.<br><br>

 <br>

On 10/5/06, <b>Shannon V. Davidson</b>

<<a href="mailto:svdavidson@charter.net">svdavidson@charter.net</a>

> wrote: <br>

<dl>

<dd>John,<br><br>

<dd>In our testing with dual port Mellanox SDR HCAs, we found that not

all PCI-express implementations are equal.  Depending on the PCIe

chipset, we measured unidirectional SDR dual-rail bandwidth ranging from

1100-1500 MB/sec and bidirectional SDR dual-rail bandwidth ranging from

1570-2600 MB/sec.  YMMV, but had good luck with Intel and Nvidia

chipsets, and less success with the Broadcom Serverworks HT-1000 and

HT-2000 chipsets. My last report (in June 2006) was that Broadcom was

working to improve their PCI-express performance. <br><br>

<dd>Regards,<br>

<dd>Shannon<br><br>

<dd>john t wrote: <br>

<blockquote type=cite class=cite cite="">

<dd>Hi Bernard,<br>

<dd> <br>

<dd>I had a configuration issue. I fixed it and now I get same BW (i.e.

around 10 Gb/sec) on each port provided I use ports on different HCA

cards. If I use two ports of the same HCA card then BW gets divided

between these two ports. I am using Mellanox HCA cards and doing simple

send/recv using uverbs. <br>

<dd> <br>

<dd>Do you think it could be an issue with Mallanox driver or could it be

due to system/PCI-E limitation.<br>

<dd> <br>

<dd>Regards,<br>

<dd>John T.<br><br>

<dd> <br>

<dd>On 10/3/06, Bernard King-Smith</b>

<<a href="mailto:wombat2@us.ibm.com">wombat2@us.ibm.com </a>>

wrote: <br>

<dl><br>

<dd><font size=2>John,</font> <br><br>

<dd><font size=2>Who's adapter (manufacturer) are you using? It is

usually an adapter implementation or driver issue that occures when you

cannot scale across multiple links. The fact that you don't scale up from

one link, but it appears they share a fixed bandwidth across N links

means that there is a driver or stack issue. At one time I think that

IPoIB and maybe other IB drivers used only one event queue across

multiple links which would be a bottleneck. We added code in the IBM EHCA

driver to get round this bottleneck. </font><br><br>

<dd><font size=2>Are your measurements using MPI or IP. Are you using

separate tasks/sockets per link and using different subnets if using

IP?</font> <br>

<font size=2><br>

<dd>Bernie King-Smith  <br>

<dd>IBM Corporation<br>

<dd>Server Group<br>

<dd>Cluster System Performance  <br>

<dd><a href="mailto:wombat2@us.ibm.com">wombat2@us.ibm.com</a>

    (845)433-8483 <br>

<dd>Tie. 293-8483 or wombat2 on NOTES <br><br>

<dd>"We are not responsible for the world we are born into, only for

the world we leave when we die.<br>

<dd>So we have to accept what has gone before us and work to change the

only thing we can, <br>

<dd>-- The Future." William Shatner</font> <br><br>

<dd><tt><font face="Courier, Courier" size=2>john t"

<<a href="mailto:johnt1johnt2@gmail.com">

johnt1johnt2@gmail.com</a>> wrote on 10/03/2006 09:42:24

AM:</font></tt> <br><br>

<dd><tt><font face="Courier, Courier" size=2>> <br>

<dd>> Hi,</font></tt> <br>

<dd><tt><font face="Courier, Courier" size=2>>  </font></tt>

<br>

<dd><tt><font face="Courier, Courier" size=2>> I have two HCA cards,

each having two ports and each connected to a <br>

<dd>> separate PCI-E x8 slot. </font></tt><br>

<dd><tt><font face="Courier, Courier" size=2>>  </font></tt>

<br>

<dd><tt><font face="Courier, Courier" size=2>> Using one HCA port I

get end to end BW of 11.6 Gb/sec (uni-direction RDMA).</font></tt> <br>

<dd><tt><font face="Courier, Courier" size=2>> If I use two ports of

the same HCA or different HCA, I get between 5 <br>

<dd>> to 6.5 Gb/sec point-to-point BW on each port. BW on each port

<br>

<dd>> further reduces if I use more ports. I am not able to understand

<br>

<dd>> this behaviour. Is there any limitation on max. BW that a system

can <br>

<dd>> provide? Does the available BW get divided among multiple HCA

ports <br>

<dd>> (which means having multiple ports will not increase the BW)?

</font></tt><br>

<dd><tt><font face="Courier, Courier" size=2>>  </font></tt>

<br>

<dd><tt><font face="Courier, Courier" size=2>>  </font></tt>

<br>

<dd><tt><font face="Courier, Courier" size=2>> Regards,</font></tt>

<br>

<dd><tt><font face="Courier, Courier" size=2>> John T<br>

</font></tt><br>

</dl><br><br>

<br>

<pre>

<dd>_______________________________________________

<dd>openib-general mailing list

<dd><a href="mailto:openib-general@openib.org">

openib-general@openib.org</a>

<dd>

<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">

http://openib.org/mailman/listinfo/openib-general</a>

<dd>To unsubscribe, please visit

<a href="http://openib.org/mailman/listinfo/openib-general">

http://openib.org/mailman/listinfo/openib-general</a></pre>

<font face="Courier New, Courier"></font></blockquote><br><br>

<br>

<dd><pre>-- 

<dd>____________________________________________

<dd>Shannon V. Davidson

<a href="mailto:svdavidson@charter.net"><svdavidson@charter.net></a>

<dd>Senior Software

Engineer           

Raytheon

<dd>636-479-7465 office        

443-383-0331 fax

<dd>____________________________________________

</pre><font face="Courier New, Courier"></font>

</dl><br>

_______________________________________________<br>

openib-general mailing list<br>

openib-general@openib.org<br>

<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">

http://openib.org/mailman/listinfo/openib-general</a><br><br>

To unsubscribe, please visit

<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">

http://openib.org/mailman/listinfo/openib-general</a>

</blockquote></body>

</html>