[openib-general] Speeding up IPoIB.

Wed Apr 19 12:10:29 PDT 2006

[sorry if this forum is the wrong place to take this up]

Grant Grundler <iod00d at hp.com> wrote :

Grant> [ I've probably posted some of these results before...here's another
Grant> take on this problem. ]

Hopefully not rehashing too much old information.

Grant> I'm expect splitting the RX/TX completions would achieve something
Grant> similar since we are just "slicing" the same problem from a
different
Grant> angle.  Apps typically do both RX and TX and will be running on one
Grant> CPU. So on one path they will be missing cachelines.

However, the event handler(s) handling the RX/TX completion are not
guaranteed to run on the same CPU as the application unless you have the
scheduler do some kind of affinity between the application and the event
handler for the completion queue. In addition, if an application has
multiple sockets then the event handlers are all of the place because each
socket has its own completion queue. Does one event handler handle all
completion queues?

Grant> Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA
perf.
Grant> If folks really care about perf, they have to migrate away from
Grant> IPoIB to either SDP or directly use RDMA (uDAPL or something).
Grant> Splitting RX/TX completions might help initial adoption, but
Grant> aren't were the big wins in perf are.

My take is, good enough is not good enough. If the cost to move from IP  to
SDP or RDMA is too great, then applications ( particularly in the
commercial sector ) will not convert. Hence if IPoIB is too slow they will
go Ethernet. Currently we only get 40% of the link bandwidth compared to
85% for 10 GigE. (Yes I know the cost differences which favor IB ).

However, two things hurt user level protocols. First is scaling and memory
requirements. Looking at parallel file systems on large clusters, SDP ended
up consuming so much memory it couldn't be used. The N by N socket
connections per node, using SDP the required buffer space and QP memory got
out of control. There is something to be said for sharing buffer and QP
space across lots of sockets.

The other issue is flow control across hundreds of autonomous sockets. In
TCP/IP, traffic can be managed so that there is some fairness
(multiplexing, QoS etc.) across all active sockets. For user level
protocols like SDP and uDAPL, you can't manage traffic across multiple
autonomous user application connections because ther is no where to see all
of them at teh same tiem for mangement. This can lead to overrunning
adapters or timeouts to the applications. This tends to be a large system
problem when you have lots of CPUs.

SDP and uDAPL has some good ideas but have a way to go for anything except
HPC and workloads that are not expected to scale to large configurations.
For HPC you can use MPI for application message passing, but for the rest
of the cluster traffic you need a good performing IP implementation for
now. With time things can improve. There is also IPoIB-CM for much lower
IPoIB overhead.

Grant> Pinning netperf/netserver to a different CPU caused SDP perf
Grant> to drop from 5.5 Gb/s to 5.4 Gb/s. Service Demand went from
Grant> around 0.55 usec/KB to 0.56 usec/KB. ie a much smaller impact
Grant> on cacheline misses.

I agree cacheline misses are something that has to be watched carefully.
for some platforms we need better binding or affinity tools in Linux to
solve some of the current problems. This is a bigger long term issue.

Grant> Keeping traffic local to the CPU that's taking the interrupt
Grant> keeps the cachelines local. I don't want to discourage anyone
Grant> from their pet projects. But the conclusion I drew from the
Grant> above data is IPoIB is a good compatibility story but cacheline
Grant> misses are going to make it hard to improve perf regardless
Grant> of how we divide the workload. IPoIB + TCP/IP code path just has
Grant> a big foot print.

The footprint of IPoIB + TCP/IP is large as on any system, However, as you
get to higher CPU counts, the issue becomes less of a problem since more
unused CPU cycles are available. However, affinity ( CPU and Memory)  and
cacheline miss issues get greater.

Bernie King-Smith
IBM Corporation
Server Group
Cluster System Performance
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES

"We are not responsible for the world we are born into, only for the world
we leave when we die.
So we have to accept what has gone before us and work to change the only
thing we can,
-- The Future." William Shatner