[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching

Mon Oct 8 15:33:42 PDT 2007

> true, that needs some resolution. Heres a hand-waving thought:
> Assuming all packets of a specific map end up in the same 
> qdiscn queue, it seems feasible to ask the qdisc scheduler to 
> give us enough packages (ive seen people use that terms to 
> refer to packets) for each hardware ring's available space. 
> With the patches i posted, i do that via
> dev->xmit_win that assumes only one view of the driver; essentially a
> single ring.  
> If that is doable, then it is up to the driver to say "i have 
> space for 5 in ring[0], 10 in ring[1] 0 in ring[2]" based on 
> what scheduling scheme the driver implements - the dev->blist 
> can stay the same. Its a handwave, so there may be issues 
> there and there could be better ways to handle this.
> 
> Note: The other issue that needs resolving that i raised 
> earlier was in regards to multiqueue running on multiple cpus 
> servicing different rings concurently. 

I can see the qdisc being modified to send batches per queue_mapping.
This shouldn't be too difficult, and if we had the xmit_win per queue
(in the subqueue struct like Dave pointed out).

Addressing your note/issue with different rings being services
concurrently: I'd like to remove the QDISC_RUNNING bit from the global
device; with Tx multiqueue, this bit should be set on each queue (if at
all), allowing multiple Tx rings to be loaded simultaneously.  The
biggest issue today with the multiqueue implementation is the global
queue_lock.  I see it being a hot source of contention in my testing; my
setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using
8 Tx and 8 Rx queues.  On transmit, when loading all 8 queues, the
enqueue/dequeue are hitting that lock quite a bit for the whole device.
I really think that the queue_lock should join the queue_state, so the
device no longer manages the top-level state (since we're operating
per-queue instead of per-device).

> It's the core that does that, not the driver; the driver 
> continues to use ->hard_start_xmit() (albeit modified one). 
> The idea is not to have many new interfaces.

I'll look closer at this, since I think I confused myself.

> Isnt multiqueue mainline already?

Well, it's in 2.6.23-rc*.  I imagine it won't see much action though
until 2.6.24, since people will be porting drivers during that time.
Plus having the native Rx multiqueue w/NAPI code in 2.6.24 makes sense
to have Tx multiqueue at that time.

> The point behind batching is to reduce the cost of the locks 
> by amortizing across the locks. Even better if one can, they 
> should get rid of locks. Remind me, why do you need the 
> per-queuemap lock? And is it needed from the enqueuing side 
> too? Maybe lets start there to help me understand things?

The multiqueue implementation today enforces the number of qdisc bands
(RR or PRIO) to be equal to the number of Tx rings your hardware/driver
is supporting.  Therefore, the queue_lock and queue_state in the kernel
directly relate to the qdisc band management.  If the queue stops from
the driver, then the qdisc won't try to dequeue from the band.  What I'm
working on is to move the lock there too, so I can lock the queue when I
enqueue (protect the band from multiple sources modifying the skb
chain), and lock it when I dequeue.  This is purely for concurrency of
adding/popping skb's from the qdisc queues.  Right now, we take the
whole global lock to add and remove skb's.  This is the next logical
step for separating the queue dependancy on each other.  Please let me
know if this doesn't make sense, or if you have any questions at all
about my reasoning.  I agree that this is where we should be on the same
page before moving onto anything else in this discussion.  :)

> Sure, that is doable if the driver can set a per 
> queue_mapping xmit_win and the qdisc can be taught to say 
> "give me packets for queue_mapping X"

Yes, I like this idea very much.  Do that, modify the qdisc to send in
chunks from a queue, and the problem should be solved.

I will try and find some additional cycles to get my patches completely
working, and send them.  It'd be easier I think to see what's going on
if I did that.  I'll also try to make them work with the ideas of
xmit_win per queue and batched queue qdisc sends.  Stay tuned...

Thanks Jamal,

-PJ Waskiewicz
<peter.p.waskiewicz.jr at intel.com>