[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

Wed Oct 10 02:25:50 PDT 2007

From: Andi Kleen <andi at firstfloor.org>
Date: Wed, 10 Oct 2007 11:16:44 +0200

> > A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
> 
> With TSO really? 

Yes.

> > increase the size much more performance starts to go down due to L2
> > cache thrashing.
> 
> Another possibility would be to consider using cache avoidance
> instructions while updating the TX ring (e.g. write combining 
> on x86) 

The chip I was working with at the time (UltraSPARC-IIi) compressed
all the linear stores into 64-byte full cacheline transactions via
the store buffer.

It's true that it would allocate in the L2 cache on a miss, which
is different from your suggestion.

In fact, such a thing might not pan out well, because most of the time
you write a single descriptor or two, and that isn't a full cacheline,
which means a read/modify/write is the only coherent way to make such
a write to RAM.

Sure you could batch, but I'd rather give the chip work to do unless
I unequivocably knew I'd have enough pending to fill a cacheline's
worth of descriptors.  And since you suggest we shouldn't queue in
software... :-)