[ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Mon May 5 13:52:28 PDT 2008


Al Chu wrote:
> Hey Yevgeny,
> 
> This looks like a great idea.  But is there a reason its only supported
> for LMC=0?  Since the caching is handled at the ucast-mgr level (rather
> than in the routing algorithm code), I don't quite see why LMC=0
> matters.  

No particular reason - I'll enhance it for LMC>0, just didn't find the time
to do it right now. The cached topology model is based on LIDs, so I just
need to check that LMC>0 doesn't break anything.
I also had a more complex topology and routing model, where I wasn't relying
on LIDs - I had what I called "Virtual LIDs", and at every heavy sweep
the topology model was built and Virtual LIDs were matched to LIDs to create
VLID <-> LID mapping, so that the cache won't depend on fabric LIDs, and
there I had some problems with LMC (can't remember what exactly), but that
model proved to be useless.

> Maybe it is b/c of future incremental routing on your todo?  If that's
> the case, instead of only caching when LMC=0, perhaps initial
> incremental routing should only work under LMC=0.  Later on incremental
> routing for LMC > 0 could be added.

Agree, that is what I eventually should do.

-- Yevgeny

> Al
> 
> On Sun, 2008-05-04 at 13:08 +0300, Yevgeny Kliteynik wrote:
>> One thing I need to add here: ucast cache is currently supported
>> for LMC=0 only.
>>
>> -- Yevgeny
>>
>> Yevgeny Kliteynik wrote:
>>> Hi Sasha,
>>>
>>> The following series of 4 patches implements unicast routing cache
>>> in OpenSM.
>>>
>>> None of the current routing engines is scalable when we're talking
>>> about big clusters. On ~5K cluster with ~1.3K switches, it takes
>>> about two minutes to calculate the routing. The problem is, each
>>> time the routing is calculated from scratch.
>>>
>>> Incremental routing (which is on my to-do list) aims to address this
>>> problem when there is some "local" change in fabric (e.g. single
>>> switch failure, single link failure, link added, etc).
>>> In such cases we can use the routing that was already calculated in
>>> the previous heavy sweep, and then we just have to modify it according
>>> to the change.
>>>
>>> For instance, if some switch has disappeared from the fabric, we can
>>> use the routing that existed with this switch, take a step back from
>>> this switch and see if it is possible to route all the lids that were
>>> routed through this switch some other way (which is usually the case).
>>>
>>> To implement incremental routing, we need to create some kind of unicast
>>> routing cache, which is what these patches implement. In addition to being
>>> a step toward the incremental routing, routing cache is usefull by itself.
>>>
>>> This cache can save us routing calculation in case of change in the leaf
>>> switches or in hosts. For instance, if some node is rebooted, OpenSM would
>>> start a heavy sweep with full routing recalculation when the HCA is going
>>> down, and another one when HCA is brought up, when in fact both of these
>>> routing calculation can be replaced by using of unicast routing cache.
>>>
>>> Unicast routing cache comprises the following:
>>>  - Topology: a data structure with all the switches and CAs of the fabric
>>>  - LFTs: each switch has an LFT cached
>>>  - Lid matrices: each switch has lid matrices cached, which is needed for
>>>    multicast routing (which is not cached).
>>>
>>> There is a topology matching function that compares the current topology
>>> with the cached one to find out whether the cache is usable (valid) or not.
>>>
>>> The cache is used the following way:
>>>  - SM is executed - it starts first routing calculation
>>>  - calculated routing is stored in the cache
>>>  - at some point new heavy sweep is triggered
>>>  - unicast manager checks whether the cache can be used instead
>>>    of new routing calculation.
>>>    In one of the following cases we can use cached routing
>>>     + there is no topology change
>>>     + one or more CAs disappeared (they exist in the cached topology
>>>       model, but missing in the newly discovered fabric)
>>>     + one or more leaf switches disappeared
>>>    In these cases cached routing is written to the switches as is
>>>    (unless the switch doesn't exist).
>>>    If there is any other topology change:
>>>      - existing cache is invalidated
>>>      - topology is cached
>>>      - routing is calculated as usual
>>>      - routing is cached
>>>
>>> My simulations show that when the usual routing phase of the heavy
>>> sweep on the topology that I mentioned above takes ~2 minutes,
>>> cached routing reduces this time to 6 seconds (which is nice, if you
>>> ask me...).
>>>
>>> Of all the cases when the cache is valid, the most painful and
>>> "complainable" case is when a compute node reboot (which happens pretty
>>> often) causes two heavy sweeps with two full routing calculations.
>>> Unicast Routing Cache is aimed to solve this problem (again, in addition
>>> to being a step toward the incremental routing).
>>>
>>> -- Yevgeny
>>> _______________________________________________
>>> general mailing list
>>> general at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list