[openib-general] Multi-port HCA

Fri Oct 6 10:30:26 PDT 2006

Off-line someone asked me to clarify my earlier e-mail.  Given this 
discussion continues, perhaps this might help explain the performance a bit 
more.  The Max Payload Size quoted here is what is typically implemented on 
x86 chipsets though other chipsets may use a larger value.  From a pure 
bandwidth perspective (which is not typical of many applications), this 
should be reasonable accurate.  In any case, this is just a fyi.

A x4 IB 5 GT/s is 20 Gbps raw (customers do comprehend the marketing hype 
does not translate into that bandwidth being available for applications - I 
have had to explain this to the press in the past about how raw does equal 
application available bandwidth).   Take off 8b/10b, protocol overheads, 
etc. and assuming a 2KB PMTU, then one can expect to hit perhaps 14-15 Gbps 
per direction depending upon the workload.   Let's assume an aggregate of 
30 Gbps of potential application bandwidth for simplicity.   The PCIe x8 
2.5 GT/s is 20 Gbps raw so take off the 8b/10b, protocol overheads, control 
/ application overheads, etc. and given it uses at most a 256B Max Payload 
Size on DMA Writes and cache line sized DMA Read Completions (64B) though 
many people use PIO Writes to avoid DMA Reads when it comes to 
micro-benchmarks, the actual performance is unlikely to hit what IB might 
drive depending upon the direction and mix of control and application data 
transactions.  Add in the impacts on memory controller which in real-world 
applications is servicing the processors quite a bit more than illustrated 
by micro-benchmarks and the ability of a system to drive an IB x4 DDR 
device at link rate is very questionable.

The question is whether this really matters.  If you examine most workloads 
on various platforms, they simply cannot generate enough bandwidth to 
consume the external I/O bandwidth capacity.  In many cases, they are 
constrained by the processor or the combination of the processor / memory 
components.  This isn't a bad thing when you think about it.  For many 
customers, it means that the attached I/O fabrics will be sufficiently 
provisioned to eliminate or largely mitigate the impacts of external fabric 
events, e.g. congestion, and deliver a reasonable solution using the 
existing hardware (issues of topology, use of multi-path, etc. all come 
into bearing as a function of fabric diameter).    In the end, customers 
care about whether the application performs as expected and where the real 
bottlenecks lie.  For most applications, it will come down to the processor 
/ memory subsystems and not the I/O or external fabric.

While I haven't seen all of the latest DDR micro-benchmark results, I 
believe the x4 IB SDR numbers largely align with what I've outlined here.

Mike

At 02:09 AM 10/6/2006, john t wrote:
>Hi Shannon,
>
>The bandwidth figures that you quoted below match with my readings for 
>single port Mellanox DDR HCA (both for unidirection and bidirection). So 
>it seems dual port SDR HCA performs as good as single port DDR HCA. It 
>would help if you can also tell the bandwidth that you got using one port 
>of your dual-port SDR HCA card. Was it half the bandwidth that you stated 
>below, which means having two SDR ports per HCA helps.
>
>In my case it seems having two ports (DDR) per HCA does not increase BW, 
>since PCI-e x8 limit is 16 Gb/sec per direction and each of the two HCA 
>ports (DDR) though capable of transferring 16 Gb/sec in each direction, 
>when used together can not go above 16 Gb/sec.
>
>Regards,
>John T.
>
>
>On 10/5/06, Shannon V. Davidson 
><<mailto:svdavidson at charter.net>svdavidson at charter.net> wrote:
>John,
>
>In our testing with dual port Mellanox SDR HCAs, we found that not all 
>PCI-express implementations are equal.  Depending on the PCIe chipset, we 
>measured unidirectional SDR dual-rail bandwidth ranging from 1100-1500 
>MB/sec and bidirectional SDR dual-rail bandwidth ranging from 1570-2600 
>MB/sec.  YMMV, but had good luck with Intel and Nvidia chipsets, and less 
>success with the Broadcom Serverworks HT-1000 and HT-2000 chipsets. My 
>last report (in June 2006) was that Broadcom was working to improve their 
>PCI-express performance.
>
>Regards,
>Shannon
>
>john t wrote:
>>Hi Bernard,
>>
>>I had a configuration issue. I fixed it and now I get same BW (i.e. 
>>around 10 Gb/sec) on each port provided I use ports on different HCA 
>>cards. If I use two ports of the same HCA card then BW gets divided 
>>between these two ports. I am using Mellanox HCA cards and doing simple 
>>send/recv using uverbs.
>>
>>Do you think it could be an issue with Mallanox driver or could it be due 
>>to system/PCI-E limitation.
>>
>>Regards,
>>John T.
>>
>>
>>On 10/3/06, Bernard King-Smith 
>><<mailto:wombat2 at us.ibm.com>wombat2 at us.ibm.com > wrote:
>>
>>John,
>>
>>Who's adapter (manufacturer) are you using? It is usually an adapter 
>>implementation or driver issue that occures when you cannot scale across 
>>multiple links. The fact that you don't scale up from one link, but it 
>>appears they share a fixed bandwidth across N links means that there is a 
>>driver or stack issue. At one time I think that IPoIB and maybe other IB 
>>drivers used only one event queue across multiple links which would be a 
>>bottleneck. We added code in the IBM EHCA driver to get round this bottleneck.
>>
>>Are your measurements using MPI or IP. Are you using separate 
>>tasks/sockets per link and using different subnets if using IP?
>>
>>Bernie King-Smith
>>IBM Corporation
>>Server Group
>>Cluster System Performance
>><mailto:wombat2 at us.ibm.com>wombat2 at us.ibm.com    (845)433-8483
>>Tie. 293-8483 or wombat2 on NOTES
>>
>>"We are not responsible for the world we are born into, only for the 
>>world we leave when we die.
>>So we have to accept what has gone before us and work to change the only 
>>thing we can,
>>-- The Future." William Shatner
>>
>>john t" <<mailto:johnt1johnt2 at gmail.com> johnt1johnt2 at gmail.com> wrote on 
>>10/03/2006 09:42:24 AM:
>>
>> >
>> > Hi,
>> >
>> > I have two HCA cards, each having two ports and each connected to a
>> > separate PCI-E x8 slot.
>> >
>> > Using one HCA port I get end to end BW of 11.6 Gb/sec (uni-direction 
>> RDMA).
>> > If I use two ports of the same HCA or different HCA, I get between 5
>> > to 6.5 Gb/sec point-to-point BW on each port. BW on each port
>> > further reduces if I use more ports. I am not able to understand
>> > this behaviour. Is there any limitation on max. BW that a system can
>> > provide? Does the available BW get divided among multiple HCA ports
>> > (which means having multiple ports will not increase the BW)?
>> >
>> >
>> > Regards,
>> > John T
>>
>>
>>
>>
>>
>>
>>_______________________________________________
>>
>>openib-general mailing list
>>
>><mailto:openib-general at openib.org>openib-general at openib.org
>>
>>http://openib.org/mailman/listinfo/openib-general
>>
>>
>>To unsubscribe, please visit 
>><http://openib.org/mailman/listinfo/openib-general>http://openib.org/mailman/listinfo/openib-general
>
>
>
>--
>
>____________________________________________
>
>
>Shannon V. Davidson <mailto:svdavidson at charter.net><svdavidson at charter.net>
>
>Senior Software Engineer            Raytheon
>
>636-479-7465 office         443-383-0331 fax
>
>____________________________________________
>
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061006/8affd2d0/attachment.html>