[openib-general] Re: openib-general Digest, Vol 22, Issue 114

Bernard King-Smith wombat2 at us.ibm.com
Tue Apr 18 13:48:28 PDT 2006


    Shirley> Some tests have been done over mthca and
    Shirley> ehca. Unidirectional stream test, gains up to 15%
    Shirley> throughout with this patch on systems over 4 cpus.
    Shirley> Bidirectional could gain more. People might get different
    Shirley> performance improvement number under different drivers
    Shirley> and cpus. I have attached the patch for who are willing
    Shirley> to run the performance test with different drivers. And
    Shirley> please give your inputs.

Roland> Have you ever seen this hurt performance?  It seems that splitting
Roland> receives and send CQs will increase the number of events generated
and
Roland> possibly use more CPU.

The problem occurs when you exceed the performance of a single CPU. WE have
been running on multiple CPU systems, and this change actually helps
performance on 2 CPU running 4 hyperthreads using 2 sockets. One socket for
sending and one socket for receiving. If you look at recent IP performance
using IPoIB, you see that exchange bandwidth is not much faster than
unidirectional ( using Netperf ).

Roland> Actually, do you have some explanation for why this helps
performance?
Roland> My intuition would be that it just generates more interrupts for
the
Roland> same workload.

On a multiple CPU system looking at TOP you see one process consuming a
full CPU. This happens to be the thread handling completion queue entries.
I suggested that we look at separate threads handing send completions vs.
receive completions. When we ran with the split completion queue patch, we
no longer see one process pegging the CPU at 100% and we get a speedup of
65% going from STREAM to Duplex. Without the split completion queue, we
only saw a 15% speedup going from STREAM to Duplex.

The overall CPU utilization does increase with the split completion queue
handling, but proportional to the increased bandwidth it is no higher per
MB/s than a single handler.

You probably won;t see any improvement in performance on 1 or 2 CPU systems
because you are already out of CPU at these bandwidths. However, for
machines with 4 or more CPUs, either hyperthreads or physical cores, you
will see the benefit in duplex bandwidth.


Bernie King-Smith
IBM Corporation
Server Group
Cluster System Performance
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES

"We are not responsible for the world we are born into, only for the world
we leave when we die.
So we have to accept what has gone before us and work to change the only
thing we can,
-- The Future." William Shatner




More information about the general mailing list