[ofa-general] [OpenSM] [PATCH 0/3] New "port-offsetting" option to updn/minhop routing

Wed May 28 17:14:59 PDT 2008

Hey Sasha,

Attached are some numbers from a recent run I did with my port
offsetting patches.  I ran w/ mvapich 0.9.9 and OpenMPI 1.2.6 on 120
nodes.  I ran w/ 1 task per node or 8 tasks per node (nodes have 8
processors each), trying LMC=0, LMC=1, and LMC=2 with the original
'updn', then LMC=1 and LMC=2 with my port-offsetting patch (labeled
"PO").  Next to these columns are the percentage worse the numbers are
in comparison to LMC=0.  My understanding is that mvapich 0.9.9 does not
know how to take advantage of multiple lids while openMPI 1.2.6 does
know how to take advantage of it.

I think the key numbers to notice are that without port-offsetting,
performance relative to LMC=0 is pretty bad when the MPI implementation
does not know how to take advantage of multiple lids (mvapich 0.9.9).
LMC=1 shows ~30% performance degradation and LMC=2 shows ~90%
degradation on this cluster.  With the port-offsetting turned on, the
degradation falls to 0%-6%, a few times even being faster.  We consider
this within "noise" levels.

For MPIs that do know how to take advantage of multiple lids it seems
that the port-offsetting patch doesn't affect performance that much.
(See OpenMPI 1.2.6 sections).

PLMK what you think.  Thanks.

Al

On Thu, 2008-04-10 at 14:10 -0700, Al Chu wrote:
> Hey Sasha,
> 
> I was going to submit this after I had a chance to test on one of our
> big clusters to see if it worked 100% right.  But my final testing has
> been delayed (for a month now!).  Ira said some folks from Sonoma were
> interested in this, so I'll go ahead and post it.
> 
> This is a patch for something I call "port_offsetting" (name/description
> of the option is open to suggestion).  Basically, we want to move to
> using lmc > 0 on our clusters b/c some of the newer MPI implementations
> take advantage of multiple lids and have shown faster performance when
> lmc > 0.
> 
> The problem is that those users that do not use the newer MPI
> implementations, or do not run their code in a way that can take
> advantage of multiple lids, suffer great performance degradation in
> their code.  We determined that the primary issue is what we started
> calling "base lid alignment".  Here's a simple example.
> 
> Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D).
> Those lids are:
> 
> port A - 1,2,3,4
> port B - 5,6,7,8
> port C - 9,10,11,12
> port D - 13,14,15,16
> 
> Suppose forwarding of these lids goes through 4 switch ports.  If we
> cycle through the ports like updn/minhop currently do, we would see
> something like this.
> 
> switch port 1: 1, 5, 9, 13
> switch port 2: 2, 6, 10, 14
> switch port 3: 3, 7, 11, 15
> switch port 4: 4, 8, 12, 16
> 
> Note that the base lid of each port (lids 1, 5, 9, 13) goes through only
> 1 port of the switch.  Thus a user that uses only the base lid is using
> only 1 port out of the 4 ports they could be using.  Leading to terrible
> performance.
> 
> We want to get this instead.
> 
> switch port 1: 1, 8, 11, 14
> switch port 2: 2, 5, 12, 15
> switch port 3: 3, 6, 9,  16
> switch port 4: 4, 7, 10, 13
> 
> where base lids are distributed in a more even manner.
> 
> In order to do this, we (effectively) iterate through all ports like
> before, but we iterate starting at a different index depending on the
> number of paths we have routed thus far.
> 
> On one of our clusters, some testing has shown when we run w/ LMC=1 and
> 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than
> when LMC=0 is used.  With LMC=2, mpibench tends to be 50-70% worse in
> performance than with LMC=0.
> 
> With the port offsetting option, the performance degradation ranges 1-5%
> worse than LMC=0.  I am currently at a loss why I cannot get it to be
> even to LMC=0, but 1-5% is small enough to not make users mad :-)
> 
> The part I haven't been able to test yet is whether newer MPIs that do
> take advantage of LMC > 0 run equally when my port_offsetting is turned
> off and on.  That's the part I'm still haven't been able to test.
> 
> Thanks, look forward to your comments,
> 
> Al
> 
> 
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_port_offsetting.xls
Type: application/vnd.ms-excel
Size: 17408 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080528/82c4a362/attachment.xls>