[ofa-general] Re: [PATCH 0/4] opensm: Unicast Routing Cache

Sun Jun 29 14:33:06 PDT 2008

Hi Yevgeny,

On 12:57 Sun 04 May     , Yevgeny Kliteynik wrote:
> 
> The following series of 4 patches implements unicast routing cache
> in OpenSM.
>
> None of the current routing engines is scalable when we're talking
> about big clusters. On ~5K cluster with ~1.3K switches, it takes
> about two minutes to calculate the routing. The problem is, each
> time the routing is calculated from scratch.

I like a lot the idea to not rebuild entire routing from scratch. However
I'm very unhappy with this implementation - it is huge, complicated and
try to copy everything that OpenSM already have - topology, LFTs, lid
matrices, etc..

> Incremental routing (which is on my to-do list) aims to address this
> problem when there is some "local" change in fabric (e.g. single
> switch failure, single link failure, link added, etc).
> In such cases we can use the routing that was already calculated in
> the previous heavy sweep, and then we just have to modify it according
> to the change.
> 
> For instance, if some switch has disappeared from the fabric, we can
> use the routing that existed with this switch, take a step back from
> this switch and see if it is possible to route all the lids that were
> routed through this switch some other way (which is usually the case).
> 
> To implement incremental routing, we need to create some kind of unicast
> routing cache, which is what these patches implement.

osm_switch struct has already LFT images (if you like to convert its
cl_vector's implementation to single array like ucat_mgr has I will
appretiate this :)). I don't see why it cannot be used.

If later when switch will back we wil want to restore it as is, it will
require quite small modification in drop manager - instead of removing
and freeing switch object instance move this to same old_switches qmap.

> In addition to being
> a step toward the incremental routing, routing cache is usefull by itself.

I fully agree with it (assuming that routing cache is something that
saves full routing tables calculation for us).

> This cache can save us routing calculation in case of change in the leaf
> switches or in hosts. For instance, if some node is rebooted, OpenSM would
> start a heavy sweep with full routing recalculation when the HCA is going
> down, and another one when HCA is brought up, when in fact both of these
> routing calculation can be replaced by using of unicast routing cache.
> 
> Unicast routing cache comprises the following:
>  - Topology: a data structure with all the switches and CAs of the fabric
>  - LFTs: each switch has an LFT cached
>  - Lid matrices: each switch has lid matrices cached, which is needed for
>    multicast routing (which is not cached).

Again, all this we have already - in OpenSM tables. What is the reason
to copy this again?

> There is a topology matching function that compares the current topology
> with the cached one to find out whether the cache is usable (valid) or not.
> 
> The cache is used the following way:
>  - SM is executed - it starts first routing calculation
>  - calculated routing is stored in the cache
>  - at some point new heavy sweep is triggered
>  - unicast manager checks whether the cache can be used instead
>    of new routing calculation.
>    In one of the following cases we can use cached routing
>     + there is no topology change
>     + one or more CAs disappeared (they exist in the cached topology
>       model, but missing in the newly discovered fabric)
>     + one or more leaf switches disappeared
>    In these cases cached routing is written to the switches as is
>    (unless the switch doesn't exist).
>    If there is any other topology change:
>      - existing cache is invalidated
>      - topology is cached
>      - routing is calculated as usual
>      - routing is cached
> 
> My simulations show that when the usual routing phase of the heavy
> sweep on the topology that I mentioned above takes ~2 minutes,
> cached routing reduces this time to 6 seconds (which is nice, if you
> ask me...).

It is nice. But I think that much simpler and even more faster
implementation would be like this:
- validate topology change during subnet discovery phase (just by
  updating some p_subn->change_status mask or ->need_rerouting flag),
- check this mask/flag at beginning of osm_ucast_mgr_process() and if
  rerouting was not needed just return.

Another thing we may want to add (but it is not related to routing cache
or incremental routing) it to keep two sets of LFTs with switch object
for validation purposes - "requested" (filled by routing algorithm) and
"real" (filled from responded MADs).

In any case I passed over some submitted code anyway, will send the
comments.

Sasha