[ofa-general] [PATCHES] TX batching

Sun Oct 7 11:34:53 PDT 2007

Please provide feedback on the code and/or architecture.
Last time i posted them i received little. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev->gso_skb

What has changed since i posted last:
1) Fix a bug eyeballed by Patrick McHardy on requeue reordering.
2) Killed ->hard_batch_xmit() 
3) I am going one step back and making this set of patches even simpler
so i can make it easier to review.I am therefore killing dev->hard_prep_xmit()
and focussing just on batching. I plan to re-introduce dev->hard_prep_xmit()
but from now on i will make that a separate effort. (it seems to be creating
confusion in relation to the general work).

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev->gso_skb that you introduced).

UPCOMING PATCHES
---------------
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

PERFORMANCE TESTING
--------------------
I started testing since yesterday, but these tests take a long time
so i will post results probably at the end of the day sometime and
may stop running more tests and just comparing batch vs non-batch results.
I have optimized the kernel-config so i expect my overall performance
numbers to look better than the last test results i posted for both
batch and non-batch.
My system under test hardware is still a 2xdual core opteron with a 
couple of tg3s. 
A test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. 
Each packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I do plan also to run forwarding and TCP tests in the future when the
dust settles.

cheers,
jamal