[ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache

Sun May 4 02:57:30 PDT 2008

Hi Sasha,

The following series of 4 patches implements unicast routing cache
in OpenSM.

None of the current routing engines is scalable when we're talking
about big clusters. On ~5K cluster with ~1.3K switches, it takes
about two minutes to calculate the routing. The problem is, each
time the routing is calculated from scratch.

Incremental routing (which is on my to-do list) aims to address this
problem when there is some "local" change in fabric (e.g. single
switch failure, single link failure, link added, etc).
In such cases we can use the routing that was already calculated in
the previous heavy sweep, and then we just have to modify it according
to the change.

For instance, if some switch has disappeared from the fabric, we can
use the routing that existed with this switch, take a step back from
this switch and see if it is possible to route all the lids that were
routed through this switch some other way (which is usually the case).

To implement incremental routing, we need to create some kind of unicast
routing cache, which is what these patches implement. In addition to being
a step toward the incremental routing, routing cache is usefull by itself.

This cache can save us routing calculation in case of change in the leaf
switches or in hosts. For instance, if some node is rebooted, OpenSM would
start a heavy sweep with full routing recalculation when the HCA is going
down, and another one when HCA is brought up, when in fact both of these
routing calculation can be replaced by using of unicast routing cache.

Unicast routing cache comprises the following:
 - Topology: a data structure with all the switches and CAs of the fabric
 - LFTs: each switch has an LFT cached
 - Lid matrices: each switch has lid matrices cached, which is needed for
   multicast routing (which is not cached).

There is a topology matching function that compares the current topology
with the cached one to find out whether the cache is usable (valid) or not.

The cache is used the following way:
 - SM is executed - it starts first routing calculation
 - calculated routing is stored in the cache
 - at some point new heavy sweep is triggered
 - unicast manager checks whether the cache can be used instead
   of new routing calculation.
   In one of the following cases we can use cached routing
    + there is no topology change
    + one or more CAs disappeared (they exist in the cached topology
      model, but missing in the newly discovered fabric)
    + one or more leaf switches disappeared
   In these cases cached routing is written to the switches as is
   (unless the switch doesn't exist).
   If there is any other topology change:
     - existing cache is invalidated
     - topology is cached
     - routing is calculated as usual
     - routing is cached

My simulations show that when the usual routing phase of the heavy
sweep on the topology that I mentioned above takes ~2 minutes,
cached routing reduces this time to 6 seconds (which is nice, if you
ask me...).

Of all the cases when the cache is valid, the most painful and
"complainable" case is when a compute node reboot (which happens pretty
often) causes two heavy sweeps with two full routing calculations.
Unicast Routing Cache is aimed to solve this problem (again, in addition
to being a step toward the incremental routing).

-- Yevgeny