[openib-general] Re: Speeding up IPoIB.

Thu Apr 20 18:03:29 PDT 2006

Grant Grundler wrote:

> Currently we only get 40% of the link bandwidth compared to
> 85% for 10 GigE. (Yes I know the cost differences which favor IB ).

Grant> 10gige is getting 85% without TOE?
Grant> Or are they distributing event handling across several CPUs?

On 10 GigE they are using large send to the adapter where a 60K buffer is
read by the adapter and fragmented into 1500 or 9000 byte Ethernet packets.
Essentially they offload fragmentation to Ethernet packets from TCP to the
adapter. This is similar to RC mode in IB fragmenting larger buffers into
link 2000 byte frames/packets.

> However, two things hurt user level protocols. First is scaling and
memory
> requirements. Looking at parallel file systems on large clusters, SDP
ended
> up consuming so much memory it couldn't be used. The N by N socket
> connections per node, using SDP the required buffer space and QP memory
got
> out of control. There is something to be said for sharing buffer and QP
> space across lots of sockets.

Grant> My guess is it's an easier problem to fix SDP than reducing TCP/IP
Grant> cache/CPU foot print. I realize only a subset of apps can (or will
Grant> try to) use SDP because of setup/config issues.  I still believe SDP
Grant> is useful to a majority of apps without having to recompile them.

I agree that reducing any protocol footprint is a very challenging job,
however, going to a larger MTU drops the overhead much faster. If IB
supported a 60K MTU then the TCP/IP overhead would be 1/30 that of what we
measure today. Traversng the TCP/IP stack once for a 60K packet is much
lower than 30 times using 2000 byte packets for the same amount of data
transmitted.

> The other issue is flow control across hundreds of autonomous sockets. In
> TCP/IP, traffic can be managed so that there is some fairness
> (multiplexing, QoS etc.) across all active sockets.  For user level
> protocols like SDP and uDAPL, you can't manage traffic across multiple
> autonomous user application connections because ther is no where to see
all
> of them at the same tiem for mangement. This can lead to overrunning
> adapters or timeouts to the applications. This tends to be a large system
> problem when you have lots of CPUs.

Grant> I'm not competent to disagree in detail.
Grant> Fabian Tillier and Caitlin Bestler can (and have) addressed this.

I would be very interested in any pointers to their work.

> The footprint of IPoIB + TCP/IP is large as on any system, However, as
you
> get to higher CPU counts, the issue becomes less of a problem since more
> unused CPU cycles are available. However, affinity ( CPU and Memory)  and
> cacheline miss issues get greater.

Grant> Hrm...the concept of "unused CPU cycles" is bugging me as someone
Grant> who occasionally gets to run benchmarks.  If a system today has
Grant> unused CPU cycles, then will adding a faster link change the CPU
Grant> load if the application doesn't change?

This goes back to systems where the system is busy doing nothing, generally
when waiting for memory or a cache line miss, or I/O to disks. This is
where hyperthreading has shown some speedups for benchmarks where
previously they were totally CPU limited, and with hyperthreading there is
a gain. The unused cycles are "wait" cycles when something can run if it
can get in quickly. You can't get a TCP stack in the wait, but small parts
of the stackor driver could fit in the other thread. Yes I do benchmarking
and was skeptical at first.

Grant> thanks,
Grant> grant

Bernie King-Smith
IBM Corporation
Server Group
Cluster System Performance
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES

"We are not responsible for the world we are born into, only for the world
we leave when we die.
So we have to accept what has gone before us and work to change the only
thing we can,
-- The Future." William Shatner