Off-line someone asked me to clarify my earlier e-mail. Given this
discussion continues, perhaps this might help explain the performance a
bit more. The Max Payload Size quoted here is what is typically
implemented on x86 chipsets though other chipsets may use a larger
value. From a pure bandwidth perspective (which is not typical of
many applications), this should be reasonable accurate. In any
case, this is just a fyi.<br><br>
A x4 IB 5 GT/s is 20 Gbps raw (customers do comprehend the marketing hype
does not translate into that bandwidth being available for applications -
I have had to explain this to the press in the past about how raw does
equal application available bandwidth). Take off 8b/10b,
protocol overheads, etc. and assuming a 2KB PMTU, then one can expect to
hit perhaps 14-15 Gbps per direction depending upon the
workload. Let's assume an aggregate of 30 Gbps of potential
application bandwidth for simplicity. The PCIe x8 2.5 GT/s is
20 Gbps raw so take off the 8b/10b, protocol overheads, control /
application overheads, etc. and given it uses at most a 256B Max Payload
Size on DMA Writes and cache line sized DMA Read Completions (64B) though
many people use PIO Writes to avoid DMA Reads when it comes to
micro-benchmarks, the actual performance is unlikely to hit what IB might
drive depending upon the direction and mix of control and application
data transactions. Add in the impacts on memory controller which in
real-world applications is servicing the processors quite a bit more than
illustrated by micro-benchmarks and the ability of a system to drive an
IB x4 DDR device at link rate is very questionable. <br><br>
The question is whether this really matters. If you examine most
workloads on various platforms, they simply cannot generate enough
bandwidth to consume the external I/O bandwidth capacity. In many
cases, they are constrained by the processor or the combination of the
processor / memory components. This isn't a bad thing when you
think about it. For many customers, it means that the attached I/O
fabrics will be sufficiently provisioned to eliminate or largely mitigate
the impacts of external fabric events, e.g. congestion, and deliver a
reasonable solution using the existing hardware (issues of topology, use
of multi-path, etc. all come into bearing as a function of fabric
diameter). In the end, customers care about whether the
application performs as expected and where the real bottlenecks
lie. For most applications, it will come down to the processor /
memory subsystems and not the I/O or external fabric. <br><br>
While I haven't seen all of the latest DDR micro-benchmark results, I
believe the x4 IB SDR numbers largely align with what I've outlined
here. <br><br>
At 02:09 AM 10/6/2006, john t wrote:<br>
<blockquote type=cite class=cite cite="">Hi Shannon,<br>
The bandwidth figures that you quoted below match with my readings for
single port Mellanox DDR HCA (both for unidirection and bidirection). So
it seems dual port SDR HCA performs as good as single port DDR HCA. It
would help if you can also tell the bandwidth that you got using one port
of your dual-port SDR HCA card. Was it half the bandwidth that you stated
below, which means having two SDR ports per HCA helps. <br>
In my case it seems having two ports (DDR) per HCA does not increase BW,
since PCI-e x8 limit is 16 Gb/sec per direction and each of the two HCA
ports (DDR) though capable of transferring 16 Gb/sec in each direction,
when used together can not go above 16 Gb/sec. <br>
John T.<br><br>
On 10/5/06, <b>Shannon V. Davidson</b>
<<a href="mailto:svdavidson@charter.net">svdavidson@charter.net</a>
> wrote: <br>
<dd>In our testing with dual port Mellanox SDR HCAs, we found that not
all PCI-express implementations are equal. Depending on the PCIe
chipset, we measured unidirectional SDR dual-rail bandwidth ranging from
1100-1500 MB/sec and bidirectional SDR dual-rail bandwidth ranging from
1570-2600 MB/sec. YMMV, but had good luck with Intel and Nvidia
chipsets, and less success with the Broadcom Serverworks HT-1000 and
HT-2000 chipsets. My last report (in June 2006) was that Broadcom was
working to improve their PCI-express performance. <br><br>
<dd>john t wrote: <br>
<blockquote type=cite class=cite cite="">
<dd>Hi Bernard,<br>
<dd> <br>
<dd>I had a configuration issue. I fixed it and now I get same BW (i.e.
around 10 Gb/sec) on each port provided I use ports on different HCA
cards. If I use two ports of the same HCA card then BW gets divided
between these two ports. I am using Mellanox HCA cards and doing simple
send/recv using uverbs. <br>
<dd> <br>
<dd>Do you think it could be an issue with Mallanox driver or could it be
due to system/PCI-E limitation.<br>
<dd> <br>
<dd>John T.<br><br>
<dd> <br>
<dd>On 10/3/06, Bernard King-Smith</b>
<<a href="mailto:wombat2@us.ibm.com">wombat2@us.ibm.com </a>>
wrote: <br>
<dd><font size=2>John,</font> <br><br>
<dd><font size=2>Who's adapter (manufacturer) are you using? It is
usually an adapter implementation or driver issue that occures when you
cannot scale across multiple links. The fact that you don't scale up from
one link, but it appears they share a fixed bandwidth across N links
means that there is a driver or stack issue. At one time I think that
IPoIB and maybe other IB drivers used only one event queue across
multiple links which would be a bottleneck. We added code in the IBM EHCA
driver to get round this bottleneck. </font><br><br>
<dd><font size=2>Are your measurements using MPI or IP. Are you using
separate tasks/sockets per link and using different subnets if using
IP?</font> <br>
<font size=2><br>
<dd>Bernie King-Smith <br>
<dd>IBM Corporation<br>
<dd>Server Group<br>
<dd>Cluster System Performance <br>
<dd><a href="mailto:wombat2@us.ibm.com">wombat2@us.ibm.com</a>
(845)433-8483 <br>
<dd>Tie. 293-8483 or wombat2 on NOTES <br><br>
<dd>"We are not responsible for the world we are born into, only for
the world we leave when we die.<br>
<dd>So we have to accept what has gone before us and work to change the
only thing we can, <br>
<dd>-- The Future." William Shatner</font> <br><br>
<dd><tt><font face="Courier, Courier" size=2>john t"
<<a href="mailto:johnt1johnt2@gmail.com">
johnt1johnt2@gmail.com</a>> wrote on 10/03/2006 09:42:24
AM:</font></tt> <br><br>
<dd><tt><font face="Courier, Courier" size=2>> <br>
<dd>> Hi,</font></tt> <br>
<dd><tt><font face="Courier, Courier" size=2>> </font></tt>
<dd><tt><font face="Courier, Courier" size=2>> I have two HCA cards,
each having two ports and each connected to a <br>
<dd>> separate PCI-E x8 slot. </font></tt><br>
<dd><tt><font face="Courier, Courier" size=2>> </font></tt>
<dd><tt><font face="Courier, Courier" size=2>> Using one HCA port I
get end to end BW of 11.6 Gb/sec (uni-direction RDMA).</font></tt> <br>
<dd><tt><font face="Courier, Courier" size=2>> If I use two ports of
the same HCA or different HCA, I get between 5 <br>
<dd>> to 6.5 Gb/sec point-to-point BW on each port. BW on each port
<dd>> further reduces if I use more ports. I am not able to understand
<dd>> this behaviour. Is there any limitation on max. BW that a system
can <br>
<dd>> provide? Does the available BW get divided among multiple HCA
ports <br>
<dd>> (which means having multiple ports will not increase the BW)?
<dd><tt><font face="Courier, Courier" size=2>> </font></tt>
<dd><tt><font face="Courier, Courier" size=2>> </font></tt>
<dd><tt><font face="Courier, Courier" size=2>> Regards,</font></tt>
<dd><tt><font face="Courier, Courier" size=2>> John T<br>
<dd>openib-general mailing list
<dd><a href="mailto:openib-general@openib.org">
<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">
<dd>To unsubscribe, please visit
<a href="http://openib.org/mailman/listinfo/openib-general">
<font face="Courier New, Courier"></font></blockquote><br><br>
<dd>Shannon V. Davidson
<a href="mailto:svdavidson@charter.net"><svdavidson@charter.net></a>
<dd>Senior Software
<dd>636-479-7465 office
443-383-0331 fax
</pre><font face="Courier New, Courier"></font>
openib-general mailing list<br>
<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">
To unsubscribe, please visit
<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">