[ofa-general] SDP performance with bzcopy testing help needed

Tue Feb 26 17:44:43 PST 2008

Hi Felix;

I'm really sorry for such a slow reply. Thank you
for looking at the results!

Indeed, the performance of sockets over TOE is quite
impressive.

I did manage to update the web page last week to fix
the text regarding the memory region size.  I apologize
for getting it wrong, and thank you for setting me straight.
I also added a some acknowledgements.

Cheers,
Craig

Felix Marti wrote:
> Hi Craig,
>
> Thank you for pulling the data together on a website. I believe the
> results are quite interesting. It is probably worthwhile to point out a
> few performance points:
>
> Sockets over TOE:
> - gets line rate for small IO size (<1KB) with 1/2 line rate just north
> of 256B
> - cpu utilization drops to about 25% for receive and about 12.5% for
> transmit - out of a single core; various folks would prolly reports this
> as 8% and 3% when considering the processing power of the entire
> machine.
> - 1B latency is about 10usecs
>
> Sockets of SDP:
> - gets line rate for IO sizes of about 16KB (ZCOPY disabled) and 64KB
> (ZCOPY enabled)
> - cpu utilization is about 100%, even for large IO and the benefit of
> ZCOPY is limited (about 12.5%)
> - 1B latency is about 20usecs
>
> You can make the same comparison for Sockets over NIC as well.
>
> I believe that these numbers show the benefit of running sockets apps
> directly over the T3 TOE interface (instead of mapping a TCP streaming
> interface to a RDMA interface and then eventually back to a TCP stream
> :) which is very efficient, i.e. a lot of folks believe that TOE
> provides little benefit, and even less benefit for small IO (which is so
> crucial for many apps) but these results really prove them wrong. Note
> that the NIC requires an IO size of 4KB to reach line rate and
> performance falls off again as the IO sizes increases (beyond CPU cache
> sizes). This might even be more surprising as you use a MTU of 9KB
> (jumbo frames) and the NIC vs TOE comparison would tip in the TOE's
> favor even faster if you were to run with MTU 1500.
>
> Note that there is a little correction with respect to T3 and DMA
> address range (for iWarp). T3 does not have any address limitation and
> can DMA to/from any 64b address. However, memory region sizes are
> limited to 4GB. OFED currently attempts to map the entire address space
> for DMA (which, IMHO, is questionable as the entire address space is
> opened up for DMA - what about UNIX security semantics? :-/). It would
> prolly be better (more secure) if apps were only to map address ranges
> that they really want to DMA to/from and then a 4GB region size
> limitation seems adequate.
>
> Regards,
> felix
>
>
>
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org [mailto:general-
>> bounces at lists.openfabrics.org] On Behalf Of Craig Prescott
>> Sent: Wednesday, February 13, 2008 9:32 PM
>> To: Scott Weitzenkamp (sweitzen)
>> Cc: general at lists.openfabrics.org; jim at mellanox.com
>> Subject: Re: [ofa-general] SDP performance with bzcopy testing help
>> needed
>>
>> Scott Weitzenkamp (sweitzen) wrote:
>>>> But the effect is still clear.
>>>>
>>>> throughput:
>>>>
>>>>                64K    128K      1M
>>>>    SDP      7602.40  7560.57  5791.56
>>>>    BZCOPY   5454.20  6378.48  7316.28
>>>>
>>> Looks unclear to me.  Sometimes BZCOPY does better, sometimes worse.
>>>
>>>
>> Fair enough.
>>
>> While measuring a broader spectrum of message sizes, I noted a
>> big variation in throughput and send service demand for the SDP
>> case as a function of which core/CPU the netperf ran on.
>> Particularly, which CPU the netperf ran on relative to which
>> CPU was handling the interrupts for ib_mthca.
>>
>> Netperf has an option (-T) to allow for local and remote cpu
>> binding.  So I used it to force the client and server to run on
>> CPU 0.  Further, I mapped all ib_mthca interrupts to CPU 1 (irqbalance
>> was already disabled).  This appears to have reduced the statistical
>> error between netperf runs to negligible amounts.  I'll do more runs
>> to verify this and check out the other permutations, but this is what
>> has come out so far.
>>
>> TPUT = throughput (Mbits/sec)
>> LCL  = send service demand (usec/KB)
>> RMT  = recv service demand (usec/KB)
>>
>> "-T 0,0" option given to netperf client:
>>
>>                      SDP                   BZCOPY
>>            --------------------    --------------------
>> MESGSIZE     TPUT    LCL   RMT      TPUT     LCL   RMT
>> --------   -------  ----- -----    -------  ----- -----
>> 64K        7581.14  0.746 1.105    5547.66  1.491 1.495
>> 128K       7478.37  0.871 1.116    6429.84  1.282 1.291
>> 256K       7427.38  0.946 1.115    6917.20  1.197 1.201
>> 512K       7310.14  1.122 1.129    7229.13  1.145 1.150
>> 1M         7251.29  1.143 1.129    7457.95  0.996 1.109
>> 2M         7249.27  1.146 1.133    7340.26  0.502 1.105
>> 4M         7217.26  1.156 1.136    7322.63  0.397 1.096
>>
>> In this case, BZCOPY send service demand is significantly
>> less for the largest message sizes, though the throughput
>> for large messages is not very different.
>>
>> However, with "-T 2,2", the result looks like this:
>>
>>                      SDP                   BZCOPY
>>            --------------------    --------------------
>> MESGSIZE     TPUT    LCL   RMT      TPUT     LCL   RMT
>> --------   -------  ----- -----    -------  ----- -----
>> 64K        7599.40  0.841 1.114    5493.56  1.510 1.585
>> 128K       7556.53  1.039 1.121    6483.12  1.274 1.325
>> 256K       7155.13  1.128 1.180    6996.30  1.180 1.220
>> 512K       5984.26  1.357 1.277    7285.86  1.130 1.166
>> 1M         5641.28  1.443 1.343    7250.43  0.811 1.141
>> 2M         5657.98  1.439 1.387    7265.85  0.492 1.127
>> 4M         5623.94  1.447 1.370    7274.43  0.385 1.112
>>
>> For BZCOPY, the results are pretty similar; but for SDP,
>> the service demands are much higher, and the throughputs
>> have dropped dramatically relative to "-T 0,0".
>>
>> In either case, though, BZCOPY is more efficient for
>> large messages.
>>
>> Cheers,
>> Craig