[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order"option for updn routing

Yiftah Shahar yiftahs at voltaire.com
Mon Jun 16 14:53:50 PDT 2008


Al, Yevgeny,

> > > I understand that this guid_routing_order_file is synchronized
with
> > > an MPI rank file, right? If not, then synchronizing them might
give
> > > even better results.
> >
> > Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
> > responsible for MPI ranks, so I order the guids in my file according
to
> > how slurm is configured for chosing MPI ranks.  I will admit to
being a
> > novice to MPI's configuration (blindly accepting slurm MPI
rankings).
> > Is there an underlying file that MPI libs use for ranking knowledge?
> 
> I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
> can input a file to tell it how ranks should be assigned to nodes for
> MPI.  I assume that's what you're talking about?
> 
> Al
Upcoming Open MPI 1.3 will have such capabilities of rank placement in a
specific node and specific CPU, we will also have some decisions
settings how to communicate with different HCAs in multi HCAs node (we
also have these capabilities in VLT-MPI for more then 2 years now but it
is going into EOL stage...).

I think that more important then rank placement is communication pattern
(i.e. some ranks communicate a lot and some does not send a single
message) and this is far more complicated to do.

Yiftah


> -----Original Message-----
> From: general-bounces at lists.openfabrics.org [mailto:general-
> bounces at lists.openfabrics.org] On Behalf Of Al Chu
> Sent: Monday, June 16, 2008 23:09
> To: kliteyn at dev.mellanox.co.il
> Cc: OpenIB
> Subject: Re: [ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-
> order"option for updn routing
> 
> On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
> > Hey Yevgeny,
> >
> > On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
> > > Hi Al,
> > >
> > > Al Chu wrote:
> > > > Hey Sasha,
> > > >
> > > > This is a conceptually simple option I've developed for updn
routing.
> > > >
> > > > Currently in updn routing, nodes/guids are routed on switches in
a
> > > > seemingly-random order, which I believe is due to internal data
> > > > structure organization (i.e. cl_qmap_apply_func is called on
> > > > port_guid_tbl) as well as how the fabric is scanned (it is
logically
> > > > scanned from a port perspective, but it may not be logical from
a
> node
> > > > perspective).  I had a hypothesis that this was leading to
increased
> > > > contention in the network for MPI.
> > > >
> > > > For example, suppose we have 12 uplinks from a leaf switch to a
> spine
> > > > switch.  If we want to send data from this leaf switch to
node[13-
> 24],
> > > > the up links we will send on are pretty random. It's because:
> > > >
> > > > A) node[13-24] are individually routed at seemingly-random
points
> based
> > > > on when they are called by cl_qmap_apply_func().
> > > >
> > > > B) the ports chosen for routing are based on least used port
usage.
> > > >
> > > > C) least used port usage is based on whatever was routed earlier
on.
> > > >
> > > > So I developed this patch series, which supports an option
called
> > > > "guid_routing_order_file" which allows the user to input a file
with
> a
> > > > list of port_guids which will indicate the order in which guids
are
> > > > routed instead (naturally, those guids not listed are routed
last).
> > >
> > > Great idea!
> >
> > Thanks.
> >
> > > I understand that this guid_routing_order_file is synchronized
with
> > > an MPI rank file, right? If not, then synchronizing them might
give
> > > even better results.
> >
> > Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
> > responsible for MPI ranks, so I order the guids in my file according
to
> > how slurm is configured for chosing MPI ranks.  I will admit to
being a
> > novice to MPI's configuration (blindly accepting slurm MPI
rankings).
> > Is there an underlying file that MPI libs use for ranking knowledge?
> 
> I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
> can input a file to tell it how ranks should be assigned to nodes for
> MPI.  I assume that's what you're talking about?
> 
> Al
> 
> > > Another idea: OpenSM can create such file (list, doesn't have to
be
> > > actual file) automatically, just by checking
topologically-adjacent
> > > leaf switches and their HCAs.
> >
> > Definitely a good idea.  This patch set was just a "step one" kind
of
> > thing.
> >
> > >
> > > > I list the port guids of the nodes of the cluster from node0 to
> nodeN, one
> > > > per line in the file.  By listing the nodes in this order, I
believe
> we
> > > > could get less contention in the network.  In the example above,
> sending
> > > > to node[13-24] should use all of the 12 uplinks, b/c the ports
will
> be
> > > > equally used b/c nodes[1-12] were routed beforehand in order.
> > > >
> > > > The results from some tests are pretty impressive when I do
this.
> LMC=0
> > > > average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678
MB/s
> > > > when I use guid_routing_order.
> > >
> > > Can you compare this to the fat-tree routing?  Conceptually,
fat-tree
> > > is doing the same - it routes LIDs on nodes in a topological
order, so
> > > it would be interesting to see the comparison.
> >
> > Actually I already did :-).  w/ LMC=0.
> >
> > updn default - 391.374 MB/s
> > updn w/ guid_routing_order - 573.678 MB/s
> > ftree - 579.603 MB/s
> >
> > I later discovered that one of the internal ports of the cluster I'm
> > testing on was broken (sLB of a 288 port), and think that is the
cause
> > of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
> > designed) seemed to be able to work around it properly, while updn
(as
> > currently implemented) couldn't.
> >
> > When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were
able
> > to do better on some tests than ftree.  One example (I think these
> > numbers are in microseconds.  Lower is better):
> >
> > Alltoall 16K packets
> > ftree - 415490.6919
> > updn normal (LMC=0) - 495460.5526
> > updn w/ ordered routing (LMC=0) - 416562.7417
> > updn w/ ordered routing (LMC=1) - 453153.7289
> >  - this ^^^ result is quite odd.  Not sure why.
> > updn w/ ordered routing (LMC=2) - 3660132.1530
> >
> > We are regularly debating what will be better overall at the end of
the
> > day.
> >
> > > Also, fat-tree produces the guid order file automatically, but
nobody
> > > used it yet as an input to produce MPI rank file.
> >
> > I didn't know about this option.  How do you do this (just skimmed
the
> > manpage, didn't see anything)?  I know about the --cn_guid_file.
But
> > since that file doesn't have to be ordered, that's why I created a
> > different option (rather than have the cn_guid_file for both ftree
and
> > updn).
> >
> > Al
> >
> > > -- Yevgeny
> > >
> > > > A variety of other positive performance
> > > > increases were found when doing other tests, other MPIs, and
other
> LMCs
> > > > if anyone is interested.
> > > >
> > > > BTW, I developed this patch series before your preserve-base-lid
> patch
> > > > series.  It will 100% conflict with the preserve-base-lid patch
> series.
> > > > I will fix this patch series once the preserve-base-lids patch
> series is
> > > > committed to git.  I'm just looking for comments right now.
> > > >
> > > > Al
> > > >
> > >
> --
> Albert Chu
> chu11 at llnl.gov
> 925-422-5311
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general



More information about the general mailing list