[ofa-general] [OpenSM] [PATCH 0/3] New "port-offsetting" option to updn/minhop routing

Al Chu chu11 at llnl.gov
Thu Apr 10 14:10:15 PDT 2008


Hey Sasha,

I was going to submit this after I had a chance to test on one of our
big clusters to see if it worked 100% right.  But my final testing has
been delayed (for a month now!).  Ira said some folks from Sonoma were
interested in this, so I'll go ahead and post it.

This is a patch for something I call "port_offsetting" (name/description
of the option is open to suggestion).  Basically, we want to move to
using lmc > 0 on our clusters b/c some of the newer MPI implementations
take advantage of multiple lids and have shown faster performance when
lmc > 0.

The problem is that those users that do not use the newer MPI
implementations, or do not run their code in a way that can take
advantage of multiple lids, suffer great performance degradation in
their code.  We determined that the primary issue is what we started
calling "base lid alignment".  Here's a simple example.

Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D).
Those lids are:

port A - 1,2,3,4
port B - 5,6,7,8
port C - 9,10,11,12
port D - 13,14,15,16

Suppose forwarding of these lids goes through 4 switch ports.  If we
cycle through the ports like updn/minhop currently do, we would see
something like this.

switch port 1: 1, 5, 9, 13
switch port 2: 2, 6, 10, 14
switch port 3: 3, 7, 11, 15
switch port 4: 4, 8, 12, 16

Note that the base lid of each port (lids 1, 5, 9, 13) goes through only
1 port of the switch.  Thus a user that uses only the base lid is using
only 1 port out of the 4 ports they could be using.  Leading to terrible
performance.

We want to get this instead.

switch port 1: 1, 8, 11, 14
switch port 2: 2, 5, 12, 15
switch port 3: 3, 6, 9,  16
switch port 4: 4, 7, 10, 13

where base lids are distributed in a more even manner.

In order to do this, we (effectively) iterate through all ports like
before, but we iterate starting at a different index depending on the
number of paths we have routed thus far.

On one of our clusters, some testing has shown when we run w/ LMC=1 and
1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than
when LMC=0 is used.  With LMC=2, mpibench tends to be 50-70% worse in
performance than with LMC=0.

With the port offsetting option, the performance degradation ranges 1-5%
worse than LMC=0.  I am currently at a loss why I cannot get it to be
even to LMC=0, but 1-5% is small enough to not make users mad :-)

The part I haven't been able to test yet is whether newer MPIs that do
take advantage of LMC > 0 run equally when my port_offsetting is turned
off and on.  That's the part I'm still haven't been able to test.

Thanks, look forward to your comments,

Al


-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




More information about the general mailing list