[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order" option for updn routing

Al Chu chu11 at llnl.gov
Tue Jun 17 09:35:49 PDT 2008


Hey Yevgeny,

On Tue, 2008-06-17 at 13:59 +0300, Yevgeny Kliteynik wrote:
> Yevgeny Kliteynik wrote:
> > Hi Al,
> > 
> > Al Chu wrote:
> >> On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
> >>> Hey Yevgeny,
> >>>  
> >>> On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
> >>>> Hi Al,
> >>>>
> >>>> Al Chu wrote:
> >>>>> Hey Sasha,
> >>>>>
> >>>>> This is a conceptually simple option I've developed for updn routing.
> >>>>>
> >>>>> Currently in updn routing, nodes/guids are routed on switches in a
> >>>>> seemingly-random order, which I believe is due to internal data
> >>>>> structure organization (i.e. cl_qmap_apply_func is called on
> >>>>> port_guid_tbl) as well as how the fabric is scanned (it is logically
> >>>>> scanned from a port perspective, but it may not be logical from a node
> >>>>> perspective).  I had a hypothesis that this was leading to increased
> >>>>> contention in the network for MPI.
> >>>>>
> >>>>> For example, suppose we have 12 uplinks from a leaf switch to a spine
> >>>>> switch.  If we want to send data from this leaf switch to node[13-24],
> >>>>> the up links we will send on are pretty random. It's because:
> >>>>>
> >>>>> A) node[13-24] are individually routed at seemingly-random points 
> >>>>> based
> >>>>> on when they are called by cl_qmap_apply_func().
> >>>>>
> >>>>> B) the ports chosen for routing are based on least used port usage.
> >>>>>
> >>>>> C) least used port usage is based on whatever was routed earlier on.
> >>>>>
> >>>>> So I developed this patch series, which supports an option called
> >>>>> "guid_routing_order_file" which allows the user to input a file with a
> >>>>> list of port_guids which will indicate the order in which guids are
> >>>>> routed instead (naturally, those guids not listed are routed last).
> >>>> Great idea!
> >>> Thanks.
> >>>
> >>>> I understand that this guid_routing_order_file is synchronized with
> >>>> an MPI rank file, right? If not, then synchronizing them might give
> >>>> even better results.
> >>> Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
> >>> responsible for MPI ranks, so I order the guids in my file according to
> >>> how slurm is configured for chosing MPI ranks.  I will admit to being a
> >>> novice to MPI's configuration (blindly accepting slurm MPI rankings).
> >>> Is there an underlying file that MPI libs use for ranking knowledge?
> >>
> >> I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
> >> can input a file to tell it how ranks should be assigned to nodes for
> >> MPI.  I assume that's what you're talking about?
> > 
> > Yes, that is what I was talking about.
> > There is a host file, where you list all the hosts that MPI should use,
> > and in some MPIs there is also a way to specify the order of MPI ranks
> > that would be assigned to processes (I'm not an MPI expert, so I'm not
> > sure about the terminology that I use).
> > I know that MVAPICH is using the host order when assigning ranks, so
> > the order of the cluster nodes listed in host file is important.
> > Not sure about OpenMPI.
> > 
> >>>> Another idea: OpenSM can create such file (list, doesn't have to be
> >>>> actual file) automatically, just by checking topologically-adjacent
> >>>> leaf switches and their HCAs.
> >>> Definitely a good idea.  This patch set was just a "step one" kind of
> >>> thing.
> >>>
> >>>>> I list the port guids of the nodes of the cluster from node0 to 
> >>>>> nodeN, one
> >>>>> per line in the file.  By listing the nodes in this order, I 
> >>>>> believe we
> >>>>> could get less contention in the network.  In the example above, 
> >>>>> sending
> >>>>> to node[13-24] should use all of the 12 uplinks, b/c the ports will be
> >>>>> equally used b/c nodes[1-12] were routed beforehand in order.
> >>>>>
> >>>>> The results from some tests are pretty impressive when I do this. 
> >>>>> LMC=0
> >>>>> average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s
> >>>>> when I use guid_routing_order.
> >>>> Can you compare this to the fat-tree routing?  Conceptually, fat-tree
> >>>> is doing the same - it routes LIDs on nodes in a topological order, so
> >>>> it would be interesting to see the comparison.
> >>> Actually I already did :-).  w/ LMC=0.
> >>>
> >>> updn default - 391.374 MB/s
> >>> updn w/ guid_routing_order - 573.678 MB/s
> >>> ftree - 579.603 MB/s
> >>>
> >>> I later discovered that one of the internal ports of the cluster I'm
> >>> testing on was broken (sLB of a 288 port), and think that is the cause
> >>> of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
> >>> designed) seemed to be able to work around it properly, while updn (as
> >>> currently implemented) couldn't.
> >>>
> >>> When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able
> >>> to do better on some tests than ftree.  One example (I think these
> >>> numbers are in microseconds.  Lower is better):
> >>>
> >>> Alltoall 16K packets
> >>> ftree - 415490.6919
> >>> updn normal (LMC=0) - 495460.5526
> >>> updn w/ ordered routing (LMC=0) - 416562.7417
> >>> updn w/ ordered routing (LMC=1) - 453153.7289
> >>>  - this ^^^ result is quite odd.  Not sure why.
> >>> updn w/ ordered routing (LMC=2) - 3660132.1530
> >>>
> >>> We are regularly debating what will be better overall at the end of the
> >>> day.
> >>>
> >>>> Also, fat-tree produces the guid order file automatically, but nobody
> >>>> used it yet as an input to produce MPI rank file.
> >>> I didn't know about this option.  How do you do this (just skimmed the
> >>> manpage, didn't see anything)? 
> > 
> > Right, it's missing there. I'll add this info.
> 
> Nope, it's there:
> 
>   "The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
>    in the same directory where the OpenSM log resides. This ordering file provides
>    the CN order that may be used to create efficient communication pattern, that
>    will match the routing tables."

Thanks.  I guess I just missed it.  The manpage is getting big :-)

Al

> -- Yevgeny
> 
> 
> > The file is /var/log/opensm-ftree-ca-order.dump.
> > Small correction though - the file contains ordered list of HCA LIDs
> > and their host names. It's not a problem to change it to have guids
> > as well, but MPI doesn't need guids anyway.
> > Note that the optimal order might be different depending on the current
> > topology state and the location of the management node that runs OpenSM.
> > 
> >>> I know about the --cn_guid_file.  But
> >>> since that file doesn't have to be ordered, that's why I created a
> >>> different option (rather than have the cn_guid_file for both ftree and
> >>> updn).
> > 
> > Right, the cn file doesn't have to be ordered - ftree will order it
> > by itself. The ordering is by topology-adjacent leaf switches.
> > 
> > -- Yevgeny
> > 
> >>>
> >>> Al
> >>>
> >>>> -- Yevgeny
> >>>>
> >>>>> A variety of other positive performance
> >>>>> increases were found when doing other tests, other MPIs, and other 
> >>>>> LMCs
> >>>>> if anyone is interested.
> >>>>>
> >>>>> BTW, I developed this patch series before your preserve-base-lid patch
> >>>>> series.  It will 100% conflict with the preserve-base-lid patch 
> >>>>> series.
> >>>>> I will fix this patch series once the preserve-base-lids patch 
> >>>>> series is
> >>>>> committed to git.  I'm just looking for comments right now.
> >>>>>
> >>>>> Al
> >>>>>
> > 
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> > 
> 
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




More information about the general mailing list