[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

Tue Oct 9 17:50:25 PDT 2007

From: Andi Kleen <andi at firstfloor.org>
Date: Wed, 10 Oct 2007 02:37:16 +0200

> On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
> > We have to keep in mind, however, that the sw queue right now is 1000
> > packets.  I heavily discourage any driver author to try and use any
> > single TX queue of that size.  
> 
> Why would you discourage them? 
> 
> If 1000 is ok for a software queue why would it not be ok
> for a hardware queue?

Because with the software queue, you aren't accessing 1000 slots
shared with the hardware device which does shared-ownership
transactions on those L2 cache lines with the cpu.

Long ago I did a test on gigabit on a cpu with only 256K of
L2 cache.  Using a smaller TX queue make things go faster,
and it's exactly because of these L2 cache effects.

> 1000 packets is a lot. I don't have hard data, but gut feeling 
> is less would also do.

I'll try to see how backlogged my 10Gb tests get when a strong
sender is sending to a weak receiver.

> And if the hw queues are not enough a better scheme might be to
> just manage this in the sockets in sendmsg. e.g. provide a wait queue that
> drivers can wake up and let them block on more queue.

TCP does this already, but it operates in a lossy manner.

> I don't really see the advantage over the qdisc in that scheme.
> It's certainly not simpler and probably more code and would likely
> also not require less locks (e.g. a currently lockless driver
> would need a new lock for its sw queue). Also it is unclear to me
> it would be really any faster.

You still need a lock to guard hw TX enqueue from hw TX reclaim.

A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
increase the size much more performance starts to go down due to L2
cache thrashing.