[openib-general] Re: openib-general Digest, Vol 22, Issue 114

Wed Apr 19 09:42:26 PDT 2006

On Wed, Apr 19, 2006 at 10:10:36AM -0400, Bernard King-Smith wrote:
> The benefit you are
> working on is latency will be faster if we handle both send and receive
> processing off the same thread/interrupt, but you have to balance that with
> bandwidth limitations. You think 4X has a bandwdith problem using IPoIB,
> wait till 12X comes out.

[ I've probably posted some of these results before...here's another
take on this problem. ]

I've looked at this tradeoff pretty closely with ia64 (1.5Ghz)
by pinning netperf to a different CPU than the one handling interrupts.
By moving netperf RX traffic off the CPU handling interrupts,
the 1.5Ghz ia64 box goes from 2.8 Gb/s to around 3.5 Gb/s.
But the "service demand" (CPU time per KB payload)  goes up
from ~2.3 usec/KB to ~3.1 usec/KB - cacheline misses go up dramatically.

I'm expect splitting the RX/TX completeions would achieve something
similar since we are just "slicing" the same problem from a different
angle.  Apps typically do both RX and TX and will be running on one
CPU. So on one path they will be missing cachelines.

Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA perf.
If folks really care about perf, they have to migrate away from
IPoIB to either SDP or directly use RDMA (uDAPL or something). 
Splitting RX/TX completions might help initial adoption, but
aren't were the big wins in perf are.

Pinning netperf/netserver to a different CPU caused SDP perf
to drop from 5.5 Gb/s to 5.4 Gb/s. Service Demand went from
around 0.55 usec/KB to 0.56 usec/KB. ie a much smaller impact
on cacheline misses.

Keeping traffic local to the CPU that's taking the interrupt
keeps the cachelines local. I don't want to discourage anyone
from their pet projects. But the conclusion I drew from the
above data is IPoIB is a good compatibility story but cacheline
misses are going to make it hard to improve perf regardless
of how we divide the workload. IPoIB + TCP/IP code path just has
a big foot print.

> What per CPU utilization do you see on mthca on a multiple CPU machine
> running peak bandwidth?

I'm interested in those results as well.

thanks,
grant