<html>

<body>

<font size=3>At 06:53 PM 10/8/2007, Jeff Garzik wrote:<br>

<blockquote type=cite class=cite cite="">David Miller wrote:<br>

<blockquote type=cite class=cite cite="">From: Jeff Garzik

<jeff@garzik.org><br>

Date: Mon, 08 Oct 2007 10:22:28 -0400<br><br>

<blockquote type=cite class=cite cite="">In terms of overall

parallelization, both for TX as well as RX, my gut feeling is that we

want to move towards an MSI-X, multi-core friendly model where packets

are LIKELY to be sent and received by the same set of [cpus | cores |

packages | nodes] that the [userland] processes dealing with the

data.</blockquote>The problem is that the packet schedulers want global

guarantees<br>

on packet ordering, not flow centric ones.<br>

That is the issue Jamal is concerned about.</blockquote><br>

Oh, absolutely.<br><br>

I think, fundamentally, any amount of cross-flow resource management done

in software is an obstacle to concurrency.<br><br>

That's not a value judgement, just a statement of

fact.</font></blockquote><br>

Correct.<br><br>

<br>

<blockquote type=cite class=cite cite=""><font size=3>"traffic

cops" are intentional bottlenecks we add to the process, to enable

features like priority flows, filtering, or even simple socket fairness

guarantees.  Each of those bottlenecks serves a valid purpose, but

at the end of the day, it's still a bottleneck.<br><br>

So, improving concurrency may require turning off useful features that

nonetheless hurt concurrency.</font></blockquote><br>

Software needs to get out of the main data path - another fact of

life.<br><br>

<br><br>

<blockquote type=cite class=cite cite="">

<blockquote type=cite class=cite cite=""><font size=3>The more I think

about it, the more inevitable it seems that we really<br>

might need multiple qdiscs, one for each TX queue, to pull this full<br>

parallelization off.<br>

But the semantics of that don't smell so nice either.  If the

user<br>

attaches a new qdisc to "ethN", does it go to all the TX

queues, or<br>

what?<br>

All of the traffic shaping technology deals with the device as a

unary<br>

object.  It doesn't fit to multi-queue at all.</blockquote><br>

Well the easy solutions to networking concurrency are<br><br>

* use virtualization to carve up the machine into chunks<br><br>

* use multiple net devices<br><br>

Since new NIC hardware is actively trying to be friendly to

multi-channel/virt scenarios, either of these is reasonably

straightforward given the current state of the Linux net stack. 

Using multiple net devices is especially attractive because it works very

well with the existing packet scheduling.<br><br>

Both unfortunately impose a burden on the developer and admin, to force

their apps to distribute flows across multiple [VMs | net

devs].</font></blockquote><br>

Not the most optimal approach.<br><br>

<blockquote type=cite class=cite cite=""><font size=3>The third

alternative is to use a single net device, with SMP-friendly packet

scheduling.  Here you run into the problems you described

"device as a unary object" etc. with the current

infrastructure.<br><br>

With multiple TX rings, consider that we are pushing the packet

scheduling from software to hardware...  which implies<br>

* hardware-specific packet scheduling<br>

* some TC/shaping features not available, because hardware doesn't

support it</blockquote><br>

For a number of years now, we have designed interconnects to support a

reasonable range of arbitration capabilities among hardware resource

sets.  With reasonable classification by software to identify a

hardware resource sets (ideally interpretation of the application's view

of its priority combined with policy management software that determines

how that should map among competing application views), one can eliminate

most of the CPU cycles spent into today's implementations.   I

and others presented a number of these concepts many years ago during the

development which eventually led to IB and iWARP.<br><br>

- Each resource set can be assigned to a unique PCIe function or a

function group to enable function / group arbitration to the PCIe

link.<br><br>

- Each resource set can be assigned to a unique PCIe TC and with improved

ordering hints (coming soon) can be used to eliminate false ordering

dependencies.<br><br>

- Each resource set can be assigned to a unique IB TC / SL or iWARP

802.1p to signal priority.  These can then be used to program

respective link arbitration as well as path selection to enable

multi-path load balancing. <br><br>

- Many IHV have picked up on the arbitration capabilities and extended

them as shown years ago by a number of us to enable resource set

arbitration and a variety of QoS based policies.  If software

defines a reasonable (i.e. small) number of management and control knobs,

then these can be easily mapped to most h/w implementations.  

Some of us are working on how to do this for virtualized environments and

I expect these to be applicable to all environments in the end.<br><br>

One other key item to keep in mind is that unless there is contention in

the system, the majority of the QoS mechanisms are meaningless and in a

very large percentage of customer environments, they simply don't scale

with device and interconnect performance.   Many applications

in fact remain processor / memory constrained and therefore do not stress

the I/O subsystem or the external interconnects making most of the

software mechanisms rather moot in real customer

environments.   Simple truth is it is nearly always cheaper to

over-provision the I/O / interconnects than to use the software approach

which while quite applicable in many environments for the 1 Gbps and

below speeds, generally has less meaning / value in the 10 moving to 40

moving to 100 Gbps environments.   Does not really matter

whether one believes in protocol off-load or protocol on-load, the

interconnects will be able to handle all commercial workloads and perhaps

all but the most extreme HPC (even there one might contend that any

software intermediary would be discarded in favor of reducing OS / kernel

overhead from the main data path).  This isn't to say that software

has no role to play only that role needs to shift from main data path

overhead to one of policy shaping and programming of h/w based

arbitration.   This will hold true for both virtualized and

non-virtualized environments.<br><br>

Mike</font></body>

</html>