[ofa-general] SDP performance with bzcopy testing help needed
Craig Prescott
prescott at hpc.ufl.edu
Tue Feb 26 17:44:43 PST 2008
Hi Felix;
I'm really sorry for such a slow reply. Thank you
for looking at the results!
Indeed, the performance of sockets over TOE is quite
impressive.
I did manage to update the web page last week to fix
the text regarding the memory region size. I apologize
for getting it wrong, and thank you for setting me straight.
I also added a some acknowledgements.
Cheers,
Craig
Felix Marti wrote:
> Hi Craig,
>
> Thank you for pulling the data together on a website. I believe the
> results are quite interesting. It is probably worthwhile to point out a
> few performance points:
>
> Sockets over TOE:
> - gets line rate for small IO size (<1KB) with 1/2 line rate just north
> of 256B
> - cpu utilization drops to about 25% for receive and about 12.5% for
> transmit - out of a single core; various folks would prolly reports this
> as 8% and 3% when considering the processing power of the entire
> machine.
> - 1B latency is about 10usecs
>
> Sockets of SDP:
> - gets line rate for IO sizes of about 16KB (ZCOPY disabled) and 64KB
> (ZCOPY enabled)
> - cpu utilization is about 100%, even for large IO and the benefit of
> ZCOPY is limited (about 12.5%)
> - 1B latency is about 20usecs
>
> You can make the same comparison for Sockets over NIC as well.
>
> I believe that these numbers show the benefit of running sockets apps
> directly over the T3 TOE interface (instead of mapping a TCP streaming
> interface to a RDMA interface and then eventually back to a TCP stream
> :) which is very efficient, i.e. a lot of folks believe that TOE
> provides little benefit, and even less benefit for small IO (which is so
> crucial for many apps) but these results really prove them wrong. Note
> that the NIC requires an IO size of 4KB to reach line rate and
> performance falls off again as the IO sizes increases (beyond CPU cache
> sizes). This might even be more surprising as you use a MTU of 9KB
> (jumbo frames) and the NIC vs TOE comparison would tip in the TOE's
> favor even faster if you were to run with MTU 1500.
>
> Note that there is a little correction with respect to T3 and DMA
> address range (for iWarp). T3 does not have any address limitation and
> can DMA to/from any 64b address. However, memory region sizes are
> limited to 4GB. OFED currently attempts to map the entire address space
> for DMA (which, IMHO, is questionable as the entire address space is
> opened up for DMA - what about UNIX security semantics? :-/). It would
> prolly be better (more secure) if apps were only to map address ranges
> that they really want to DMA to/from and then a 4GB region size
> limitation seems adequate.
>
> Regards,
> felix
>
>
>
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org [mailto:general-
>> bounces at lists.openfabrics.org] On Behalf Of Craig Prescott
>> Sent: Wednesday, February 13, 2008 9:32 PM
>> To: Scott Weitzenkamp (sweitzen)
>> Cc: general at lists.openfabrics.org; jim at mellanox.com
>> Subject: Re: [ofa-general] SDP performance with bzcopy testing help
>> needed
>>
>> Scott Weitzenkamp (sweitzen) wrote:
>>>> But the effect is still clear.
>>>>
>>>> throughput:
>>>>
>>>> 64K 128K 1M
>>>> SDP 7602.40 7560.57 5791.56
>>>> BZCOPY 5454.20 6378.48 7316.28
>>>>
>>> Looks unclear to me. Sometimes BZCOPY does better, sometimes worse.
>>>
>>>
>> Fair enough.
>>
>> While measuring a broader spectrum of message sizes, I noted a
>> big variation in throughput and send service demand for the SDP
>> case as a function of which core/CPU the netperf ran on.
>> Particularly, which CPU the netperf ran on relative to which
>> CPU was handling the interrupts for ib_mthca.
>>
>> Netperf has an option (-T) to allow for local and remote cpu
>> binding. So I used it to force the client and server to run on
>> CPU 0. Further, I mapped all ib_mthca interrupts to CPU 1 (irqbalance
>> was already disabled). This appears to have reduced the statistical
>> error between netperf runs to negligible amounts. I'll do more runs
>> to verify this and check out the other permutations, but this is what
>> has come out so far.
>>
>> TPUT = throughput (Mbits/sec)
>> LCL = send service demand (usec/KB)
>> RMT = recv service demand (usec/KB)
>>
>> "-T 0,0" option given to netperf client:
>>
>> SDP BZCOPY
>> -------------------- --------------------
>> MESGSIZE TPUT LCL RMT TPUT LCL RMT
>> -------- ------- ----- ----- ------- ----- -----
>> 64K 7581.14 0.746 1.105 5547.66 1.491 1.495
>> 128K 7478.37 0.871 1.116 6429.84 1.282 1.291
>> 256K 7427.38 0.946 1.115 6917.20 1.197 1.201
>> 512K 7310.14 1.122 1.129 7229.13 1.145 1.150
>> 1M 7251.29 1.143 1.129 7457.95 0.996 1.109
>> 2M 7249.27 1.146 1.133 7340.26 0.502 1.105
>> 4M 7217.26 1.156 1.136 7322.63 0.397 1.096
>>
>> In this case, BZCOPY send service demand is significantly
>> less for the largest message sizes, though the throughput
>> for large messages is not very different.
>>
>> However, with "-T 2,2", the result looks like this:
>>
>> SDP BZCOPY
>> -------------------- --------------------
>> MESGSIZE TPUT LCL RMT TPUT LCL RMT
>> -------- ------- ----- ----- ------- ----- -----
>> 64K 7599.40 0.841 1.114 5493.56 1.510 1.585
>> 128K 7556.53 1.039 1.121 6483.12 1.274 1.325
>> 256K 7155.13 1.128 1.180 6996.30 1.180 1.220
>> 512K 5984.26 1.357 1.277 7285.86 1.130 1.166
>> 1M 5641.28 1.443 1.343 7250.43 0.811 1.141
>> 2M 5657.98 1.439 1.387 7265.85 0.492 1.127
>> 4M 5623.94 1.447 1.370 7274.43 0.385 1.112
>>
>> For BZCOPY, the results are pretty similar; but for SDP,
>> the service demands are much higher, and the throughputs
>> have dropped dramatically relative to "-T 0,0".
>>
>> In either case, though, BZCOPY is more efficient for
>> large messages.
>>
>> Cheers,
>> Craig
More information about the general
mailing list