[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order" option for updn routing

Al Chu chu11 at llnl.gov
Mon Jun 16 10:21:21 PDT 2008


Hey Yevgeny,
 
On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
> Hi Al,
> 
> Al Chu wrote:
> > Hey Sasha,
> > 
> > This is a conceptually simple option I've developed for updn routing.
> > 
> > Currently in updn routing, nodes/guids are routed on switches in a
> > seemingly-random order, which I believe is due to internal data
> > structure organization (i.e. cl_qmap_apply_func is called on
> > port_guid_tbl) as well as how the fabric is scanned (it is logically
> > scanned from a port perspective, but it may not be logical from a node
> > perspective).  I had a hypothesis that this was leading to increased
> > contention in the network for MPI.
> > 
> > For example, suppose we have 12 uplinks from a leaf switch to a spine
> > switch.  If we want to send data from this leaf switch to node[13-24],
> > the up links we will send on are pretty random. It's because:
> > 
> > A) node[13-24] are individually routed at seemingly-random points based
> > on when they are called by cl_qmap_apply_func().
> > 
> > B) the ports chosen for routing are based on least used port usage.
> > 
> > C) least used port usage is based on whatever was routed earlier on.
> > 
> > So I developed this patch series, which supports an option called
> > "guid_routing_order_file" which allows the user to input a file with a
> > list of port_guids which will indicate the order in which guids are
> > routed instead (naturally, those guids not listed are routed last).
> 
> Great idea!

Thanks.

> I understand that this guid_routing_order_file is synchronized with
> an MPI rank file, right? If not, then synchronizing them might give
> even better results.

Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
responsible for MPI ranks, so I order the guids in my file according to
how slurm is configured for chosing MPI ranks.  I will admit to being a
novice to MPI's configuration (blindly accepting slurm MPI rankings).
Is there an underlying file that MPI libs use for ranking knowledge?

> Another idea: OpenSM can create such file (list, doesn't have to be
> actual file) automatically, just by checking topologically-adjacent
> leaf switches and their HCAs.

Definitely a good idea.  This patch set was just a "step one" kind of
thing.

> 
> > I list the port guids of the nodes of the cluster from node0 to nodeN, one
> > per line in the file.  By listing the nodes in this order, I believe we
> > could get less contention in the network.  In the example above, sending
> > to node[13-24] should use all of the 12 uplinks, b/c the ports will be
> > equally used b/c nodes[1-12] were routed beforehand in order.
> > 
> > The results from some tests are pretty impressive when I do this. LMC=0
> > average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s
> > when I use guid_routing_order.
> 
> Can you compare this to the fat-tree routing?  Conceptually, fat-tree
> is doing the same - it routes LIDs on nodes in a topological order, so
> it would be interesting to see the comparison.

Actually I already did :-).  w/ LMC=0.

updn default - 391.374 MB/s
updn w/ guid_routing_order - 573.678 MB/s
ftree - 579.603 MB/s

I later discovered that one of the internal ports of the cluster I'm
testing on was broken (sLB of a 288 port), and think that is the cause
of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
designed) seemed to be able to work around it properly, while updn (as
currently implemented) couldn't.

When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able
to do better on some tests than ftree.  One example (I think these
numbers are in microseconds.  Lower is better):

Alltoall 16K packets
ftree - 415490.6919
updn normal (LMC=0) - 495460.5526
updn w/ ordered routing (LMC=0) - 416562.7417
updn w/ ordered routing (LMC=1) - 453153.7289
 - this ^^^ result is quite odd.  Not sure why.
updn w/ ordered routing (LMC=2) - 3660132.1530

We are regularly debating what will be better overall at the end of the
day.

> Also, fat-tree produces the guid order file automatically, but nobody
> used it yet as an input to produce MPI rank file.

I didn't know about this option.  How do you do this (just skimmed the
manpage, didn't see anything)?  I know about the --cn_guid_file.  But
since that file doesn't have to be ordered, that's why I created a
different option (rather than have the cn_guid_file for both ftree and
updn).

Al

> -- Yevgeny
> 
> > A variety of other positive performance
> > increases were found when doing other tests, other MPIs, and other LMCs
> > if anyone is interested.
> > 
> > BTW, I developed this patch series before your preserve-base-lid patch
> > series.  It will 100% conflict with the preserve-base-lid patch series.
> > I will fix this patch series once the preserve-base-lids patch series is
> > committed to git.  I'm just looking for comments right now.
> > 
> > Al
> > 
> 
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




More information about the general mailing list