[ofa-general] [OpenSM] [PATCH 0/3] New "port-offsetting" option to updn/minhop routing

Al Chu chu11 at llnl.gov
Wed May 28 19:30:52 PDT 2008


Oops, I forgot about one other important measurement we did.  The
following are the Average Send/Receive MPI bandwidths as measured by
mpigraph (http://sourceforge.net/projects/mpigraph).  Again, using updn
routing.

LMC=0  Send 391 MB/s  Recv 461 MB/s
LMC=1  Send 292 MB/s  Recv 358 MB/s
LMC=2  Send 197 MB/s  Recv 241 MB/s

with my port offsetting turned on.  I got

LMC=1  Send 387 MB/s  Recv 457 MB/s
LMC=2  Send 383 MB/s  Recv 455 MB/s

So similar to the AlltoAll MPI tests, the port offsetting gets the
numbers back to about what they were at LMC=0.

Al

On Wed, 2008-05-28 at 17:14 -0700, Al Chu wrote:
> Hey Sasha,
> 
> Attached are some numbers from a recent run I did with my port
> offsetting patches.  I ran w/ mvapich 0.9.9 and OpenMPI 1.2.6 on 120
> nodes.  I ran w/ 1 task per node or 8 tasks per node (nodes have 8
> processors each), trying LMC=0, LMC=1, and LMC=2 with the original
> 'updn', then LMC=1 and LMC=2 with my port-offsetting patch (labeled
> "PO").  Next to these columns are the percentage worse the numbers are
> in comparison to LMC=0.  My understanding is that mvapich 0.9.9 does not
> know how to take advantage of multiple lids while openMPI 1.2.6 does
> know how to take advantage of it.
> 
> I think the key numbers to notice are that without port-offsetting,
> performance relative to LMC=0 is pretty bad when the MPI implementation
> does not know how to take advantage of multiple lids (mvapich 0.9.9).
> LMC=1 shows ~30% performance degradation and LMC=2 shows ~90%
> degradation on this cluster.  With the port-offsetting turned on, the
> degradation falls to 0%-6%, a few times even being faster.  We consider
> this within "noise" levels.
> 
> For MPIs that do know how to take advantage of multiple lids it seems
> that the port-offsetting patch doesn't affect performance that much.
> (See OpenMPI 1.2.6 sections).
> 
> PLMK what you think.  Thanks.
> 
> Al
> 
> On Thu, 2008-04-10 at 14:10 -0700, Al Chu wrote:
> > Hey Sasha,
> > 
> > I was going to submit this after I had a chance to test on one of our
> > big clusters to see if it worked 100% right.  But my final testing has
> > been delayed (for a month now!).  Ira said some folks from Sonoma were
> > interested in this, so I'll go ahead and post it.
> > 
> > This is a patch for something I call "port_offsetting" (name/description
> > of the option is open to suggestion).  Basically, we want to move to
> > using lmc > 0 on our clusters b/c some of the newer MPI implementations
> > take advantage of multiple lids and have shown faster performance when
> > lmc > 0.
> > 
> > The problem is that those users that do not use the newer MPI
> > implementations, or do not run their code in a way that can take
> > advantage of multiple lids, suffer great performance degradation in
> > their code.  We determined that the primary issue is what we started
> > calling "base lid alignment".  Here's a simple example.
> > 
> > Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D).
> > Those lids are:
> > 
> > port A - 1,2,3,4
> > port B - 5,6,7,8
> > port C - 9,10,11,12
> > port D - 13,14,15,16
> > 
> > Suppose forwarding of these lids goes through 4 switch ports.  If we
> > cycle through the ports like updn/minhop currently do, we would see
> > something like this.
> > 
> > switch port 1: 1, 5, 9, 13
> > switch port 2: 2, 6, 10, 14
> > switch port 3: 3, 7, 11, 15
> > switch port 4: 4, 8, 12, 16
> > 
> > Note that the base lid of each port (lids 1, 5, 9, 13) goes through only
> > 1 port of the switch.  Thus a user that uses only the base lid is using
> > only 1 port out of the 4 ports they could be using.  Leading to terrible
> > performance.
> > 
> > We want to get this instead.
> > 
> > switch port 1: 1, 8, 11, 14
> > switch port 2: 2, 5, 12, 15
> > switch port 3: 3, 6, 9,  16
> > switch port 4: 4, 7, 10, 13
> > 
> > where base lids are distributed in a more even manner.
> > 
> > In order to do this, we (effectively) iterate through all ports like
> > before, but we iterate starting at a different index depending on the
> > number of paths we have routed thus far.
> > 
> > On one of our clusters, some testing has shown when we run w/ LMC=1 and
> > 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than
> > when LMC=0 is used.  With LMC=2, mpibench tends to be 50-70% worse in
> > performance than with LMC=0.
> > 
> > With the port offsetting option, the performance degradation ranges 1-5%
> > worse than LMC=0.  I am currently at a loss why I cannot get it to be
> > even to LMC=0, but 1-5% is small enough to not make users mad :-)
> > 
> > The part I haven't been able to test yet is whether newer MPIs that do
> > take advantage of LMC > 0 run equally when my port_offsetting is turned
> > off and on.  That's the part I'm still haven't been able to test.
> > 
> > Thanks, look forward to your comments,
> > 
> > Al
> > 
> > 
> -- 
> Albert Chu
> chu11 at llnl.gov
> 925-422-5311
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




More information about the general mailing list