[ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache

Mon May 5 09:32:49 PDT 2008

Hey Yevgeny,

This looks like a great idea.  But is there a reason its only supported
for LMC=0?  Since the caching is handled at the ucast-mgr level (rather
than in the routing algorithm code), I don't quite see why LMC=0
matters.  

Maybe it is b/c of future incremental routing on your todo?  If that's
the case, instead of only caching when LMC=0, perhaps initial
incremental routing should only work under LMC=0.  Later on incremental
routing for LMC > 0 could be added.

Al

On Sun, 2008-05-04 at 13:08 +0300, Yevgeny Kliteynik wrote:
> One thing I need to add here: ucast cache is currently supported
> for LMC=0 only.
> 
> -- Yevgeny
> 
> Yevgeny Kliteynik wrote:
> > Hi Sasha,
> > 
> > The following series of 4 patches implements unicast routing cache
> > in OpenSM.
> > 
> > None of the current routing engines is scalable when we're talking
> > about big clusters. On ~5K cluster with ~1.3K switches, it takes
> > about two minutes to calculate the routing. The problem is, each
> > time the routing is calculated from scratch.
> > 
> > Incremental routing (which is on my to-do list) aims to address this
> > problem when there is some "local" change in fabric (e.g. single
> > switch failure, single link failure, link added, etc).
> > In such cases we can use the routing that was already calculated in
> > the previous heavy sweep, and then we just have to modify it according
> > to the change.
> > 
> > For instance, if some switch has disappeared from the fabric, we can
> > use the routing that existed with this switch, take a step back from
> > this switch and see if it is possible to route all the lids that were
> > routed through this switch some other way (which is usually the case).
> > 
> > To implement incremental routing, we need to create some kind of unicast
> > routing cache, which is what these patches implement. In addition to being
> > a step toward the incremental routing, routing cache is usefull by itself.
> > 
> > This cache can save us routing calculation in case of change in the leaf
> > switches or in hosts. For instance, if some node is rebooted, OpenSM would
> > start a heavy sweep with full routing recalculation when the HCA is going
> > down, and another one when HCA is brought up, when in fact both of these
> > routing calculation can be replaced by using of unicast routing cache.
> > 
> > Unicast routing cache comprises the following:
> >  - Topology: a data structure with all the switches and CAs of the fabric
> >  - LFTs: each switch has an LFT cached
> >  - Lid matrices: each switch has lid matrices cached, which is needed for
> >    multicast routing (which is not cached).
> > 
> > There is a topology matching function that compares the current topology
> > with the cached one to find out whether the cache is usable (valid) or not.
> > 
> > The cache is used the following way:
> >  - SM is executed - it starts first routing calculation
> >  - calculated routing is stored in the cache
> >  - at some point new heavy sweep is triggered
> >  - unicast manager checks whether the cache can be used instead
> >    of new routing calculation.
> >    In one of the following cases we can use cached routing
> >     + there is no topology change
> >     + one or more CAs disappeared (they exist in the cached topology
> >       model, but missing in the newly discovered fabric)
> >     + one or more leaf switches disappeared
> >    In these cases cached routing is written to the switches as is
> >    (unless the switch doesn't exist).
> >    If there is any other topology change:
> >      - existing cache is invalidated
> >      - topology is cached
> >      - routing is calculated as usual
> >      - routing is cached
> > 
> > My simulations show that when the usual routing phase of the heavy
> > sweep on the topology that I mentioned above takes ~2 minutes,
> > cached routing reduces this time to 6 seconds (which is nice, if you
> > ask me...).
> > 
> > Of all the cases when the cache is valid, the most painful and
> > "complainable" case is when a compute node reboot (which happens pretty
> > often) causes two heavy sweeps with two full routing calculations.
> > Unicast Routing Cache is aimed to solve this problem (again, in addition
> > to being a step toward the incremental routing).
> > 
> > -- Yevgeny
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory