[ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache
Yevgeny Kliteynik
kliteyn at dev.mellanox.co.il
Sun May 4 03:08:51 PDT 2008
One thing I need to add here: ucast cache is currently supported
for LMC=0 only.
-- Yevgeny
Yevgeny Kliteynik wrote:
> Hi Sasha,
>
> The following series of 4 patches implements unicast routing cache
> in OpenSM.
>
> None of the current routing engines is scalable when we're talking
> about big clusters. On ~5K cluster with ~1.3K switches, it takes
> about two minutes to calculate the routing. The problem is, each
> time the routing is calculated from scratch.
>
> Incremental routing (which is on my to-do list) aims to address this
> problem when there is some "local" change in fabric (e.g. single
> switch failure, single link failure, link added, etc).
> In such cases we can use the routing that was already calculated in
> the previous heavy sweep, and then we just have to modify it according
> to the change.
>
> For instance, if some switch has disappeared from the fabric, we can
> use the routing that existed with this switch, take a step back from
> this switch and see if it is possible to route all the lids that were
> routed through this switch some other way (which is usually the case).
>
> To implement incremental routing, we need to create some kind of unicast
> routing cache, which is what these patches implement. In addition to being
> a step toward the incremental routing, routing cache is usefull by itself.
>
> This cache can save us routing calculation in case of change in the leaf
> switches or in hosts. For instance, if some node is rebooted, OpenSM would
> start a heavy sweep with full routing recalculation when the HCA is going
> down, and another one when HCA is brought up, when in fact both of these
> routing calculation can be replaced by using of unicast routing cache.
>
> Unicast routing cache comprises the following:
> - Topology: a data structure with all the switches and CAs of the fabric
> - LFTs: each switch has an LFT cached
> - Lid matrices: each switch has lid matrices cached, which is needed for
> multicast routing (which is not cached).
>
> There is a topology matching function that compares the current topology
> with the cached one to find out whether the cache is usable (valid) or not.
>
> The cache is used the following way:
> - SM is executed - it starts first routing calculation
> - calculated routing is stored in the cache
> - at some point new heavy sweep is triggered
> - unicast manager checks whether the cache can be used instead
> of new routing calculation.
> In one of the following cases we can use cached routing
> + there is no topology change
> + one or more CAs disappeared (they exist in the cached topology
> model, but missing in the newly discovered fabric)
> + one or more leaf switches disappeared
> In these cases cached routing is written to the switches as is
> (unless the switch doesn't exist).
> If there is any other topology change:
> - existing cache is invalidated
> - topology is cached
> - routing is calculated as usual
> - routing is cached
>
> My simulations show that when the usual routing phase of the heavy
> sweep on the topology that I mentioned above takes ~2 minutes,
> cached routing reduces this time to 6 seconds (which is nice, if you
> ask me...).
>
> Of all the cases when the cache is valid, the most painful and
> "complainable" case is when a compute node reboot (which happens pretty
> often) causes two heavy sweeps with two full routing calculations.
> Unicast Routing Cache is aimed to solve this problem (again, in addition
> to being a step toward the incremental routing).
>
> -- Yevgeny
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list