[ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Sun May 4 03:08:51 PDT 2008


One thing I need to add here: ucast cache is currently supported
for LMC=0 only.

-- Yevgeny

Yevgeny Kliteynik wrote:
> Hi Sasha,
> 
> The following series of 4 patches implements unicast routing cache
> in OpenSM.
> 
> None of the current routing engines is scalable when we're talking
> about big clusters. On ~5K cluster with ~1.3K switches, it takes
> about two minutes to calculate the routing. The problem is, each
> time the routing is calculated from scratch.
> 
> Incremental routing (which is on my to-do list) aims to address this
> problem when there is some "local" change in fabric (e.g. single
> switch failure, single link failure, link added, etc).
> In such cases we can use the routing that was already calculated in
> the previous heavy sweep, and then we just have to modify it according
> to the change.
> 
> For instance, if some switch has disappeared from the fabric, we can
> use the routing that existed with this switch, take a step back from
> this switch and see if it is possible to route all the lids that were
> routed through this switch some other way (which is usually the case).
> 
> To implement incremental routing, we need to create some kind of unicast
> routing cache, which is what these patches implement. In addition to being
> a step toward the incremental routing, routing cache is usefull by itself.
> 
> This cache can save us routing calculation in case of change in the leaf
> switches or in hosts. For instance, if some node is rebooted, OpenSM would
> start a heavy sweep with full routing recalculation when the HCA is going
> down, and another one when HCA is brought up, when in fact both of these
> routing calculation can be replaced by using of unicast routing cache.
> 
> Unicast routing cache comprises the following:
>  - Topology: a data structure with all the switches and CAs of the fabric
>  - LFTs: each switch has an LFT cached
>  - Lid matrices: each switch has lid matrices cached, which is needed for
>    multicast routing (which is not cached).
> 
> There is a topology matching function that compares the current topology
> with the cached one to find out whether the cache is usable (valid) or not.
> 
> The cache is used the following way:
>  - SM is executed - it starts first routing calculation
>  - calculated routing is stored in the cache
>  - at some point new heavy sweep is triggered
>  - unicast manager checks whether the cache can be used instead
>    of new routing calculation.
>    In one of the following cases we can use cached routing
>     + there is no topology change
>     + one or more CAs disappeared (they exist in the cached topology
>       model, but missing in the newly discovered fabric)
>     + one or more leaf switches disappeared
>    In these cases cached routing is written to the switches as is
>    (unless the switch doesn't exist).
>    If there is any other topology change:
>      - existing cache is invalidated
>      - topology is cached
>      - routing is calculated as usual
>      - routing is cached
> 
> My simulations show that when the usual routing phase of the heavy
> sweep on the topology that I mentioned above takes ~2 minutes,
> cached routing reduces this time to 6 seconds (which is nice, if you
> ask me...).
> 
> Of all the cases when the cache is valid, the most painful and
> "complainable" case is when a compute node reboot (which happens pretty
> often) causes two heavy sweeps with two full routing calculations.
> Unicast Routing Cache is aimed to solve this problem (again, in addition
> to being a step toward the incremental routing).
> 
> -- Yevgeny
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 




More information about the general mailing list