[openib-general] Re: Speeding up IPoIB.

Grant Grundler iod00d at hp.com
Wed Apr 19 15:56:30 PDT 2006


On Wed, Apr 19, 2006 at 03:10:29PM -0400, Bernard King-Smith wrote:
> Grant> I'm expect splitting the RX/TX completions would achieve something
> Grant> similar since we are just "slicing" the same problem from a
> different
> Grant> angle.  Apps typically do both RX and TX and will be running on one
> Grant> CPU. So on one path they will be missing cachelines.
> 
> However, the event handler(s) handling the RX/TX completion are not
> guaranteed to run on the same CPU as the application unless you have the
> scheduler do some kind of affinity between the application and the event
> handler for the completion queue. In addition, if an application has
> multiple sockets then the event handlers are all of the place because each
> socket has its own completion queue. Does one event handler handle all
> completion queues?

This depends on the HCA. mthca only uses one AFAIK.
I believe Roland just confirmed that in a previous email to Shirley Ma.

> Grant> Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA
> perf.
> Grant> If folks really care about perf, they have to migrate away from
> Grant> IPoIB to either SDP or directly use RDMA (uDAPL or something).
> Grant> Splitting RX/TX completions might help initial adoption, but
> Grant> aren't were the big wins in perf are.
> 
> My take is, good enough is not good enough. If the cost to move from IP  to
> SDP or RDMA is too great, then applications ( particularly in the
> commercial sector ) will not convert. Hence if IPoIB is too slow they will
> go Ethernet.

I agree with that assessment. I'm just pointing out that IPoIB
has a major tuning problem with TCP/IP stack.

If so, then this is a deficiency in mthca that newer hca's can address
in their MSI/MSI-X support.



> Currently we only get 40% of the link bandwidth compared to
> 85% for 10 GigE. (Yes I know the cost differences which favor IB ).

10gige is getting 85% without TOE?
Or are they distributing event handling across several CPUs?

> However, two things hurt user level protocols. First is scaling and memory
> requirements. Looking at parallel file systems on large clusters, SDP ended
> up consuming so much memory it couldn't be used. The N by N socket
> connections per node, using SDP the required buffer space and QP memory got
> out of control. There is something to be said for sharing buffer and QP
> space across lots of sockets.

My guess is it's an easier problem to fix SDP than reducing TCP/IP
cache/CPU foot print. I realize only a subset of apps can (or will
try to) use SDP because of setup/config issues.  I still believe SDP
is useful to a majority of apps without having to recompile them.

> The other issue is flow control across hundreds of autonomous sockets. In
> TCP/IP, traffic can be managed so that there is some fairness
> (multiplexing, QoS etc.) across all active sockets.  For user level
> protocols like SDP and uDAPL, you can't manage traffic across multiple
> autonomous user application connections because ther is no where to see all
> of them at teh same tiem for mangement. This can lead to overrunning
> adapters or timeouts to the applications. This tends to be a large system
> problem when you have lots of CPUs.

I'm not competent to disagree in detail.
Fabian Tillier and Caitlin Bestler can (and have) addressed this.

> SDP and uDAPL has some good ideas but have a way to go for anything except
> HPC and workloads that are not expected to scale to large configurations.
> For HPC you can use MPI for application message passing, but for the rest
> of the cluster traffic you need a good performing IP implementation for
> now. With time things can improve. There is also IPoIB-CM for much lower
> IPoIB overhead.

I had the impression that IB provided the Reliable Datagram
semantics which are equivalent to what TCP provides.
I'm sure it's not exactly the same but in general,
that disagrees with your assertion above.

> 
> Grant> Pinning netperf/netserver to a different CPU caused SDP perf
> Grant> to drop from 5.5 Gb/s to 5.4 Gb/s. Service Demand went from
> Grant> around 0.55 usec/KB to 0.56 usec/KB. ie a much smaller impact
> Grant> on cacheline misses.
> 
> I agree cacheline misses are something that has to be watched carefully.
> for some platforms we need better binding or affinity tools in Linux to
> solve some of the current problems. This is a bigger long term issue.

taskset works fine.
A GUI to "visualize" the application to IO path would be helpful
when doing runtime tuning of a given workload.

> The footprint of IPoIB + TCP/IP is large as on any system, However, as you
> get to higher CPU counts, the issue becomes less of a problem since more
> unused CPU cycles are available. However, affinity ( CPU and Memory)  and
> cacheline miss issues get greater.

Hrm...the concept of "unused CPU cycles" is bugging me as someone
who occasionally gets to run benchmarks.  If a system today has
unused CPU cycles, then will adding a faster link change the CPU 
load if the application doesn't change?

Anyway, I don't find this a good justification for using TCP if
TCP can be avoided.

thanks,
grant



More information about the general mailing list