[ofa-general] [RFC] Fat-Tree upgrades

Wed Mar 4 03:45:56 PST 2009

Yevgeny Kliteynik wrote:
> Nicolas Morey Chaisemartin wrote:
>> Hi everyone,
>>
>> We have been working quite a lot at Bull lately on the Ftree algorithm
>> and we have made some upgrades.
>> However, as they modify the behavior of the ftree algorithm, we
>> haven't pushed them until now.
>> I'm just going to detail which upgrades we have done and let you
>> decide if you are interested, if and how they should be pushed
>> upstream (new routing algorithm, option in the ftree, etc.)
>>
>> Here is a simplify model of the topology we have been working on
>>
>>                         L3  L3
>>       ___________________|__|____________________
>>      /          /               \               \                <=
>> All the L2 are connected on 2 L3 switches
>>   L2-1         L2-2            L2-1           L2-2                 <=
>> There are service nodes connected directly on L2 switches
>>  /    \S1       / \S2       S3/    \       S4/   \                 <==
>> The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
>>  L1           L1                 L1             L1
>>  /|\         /|\                 /|\           /|\
>> ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have
>> multiple set. In each set, all L0 lead to all L1 of their set.
>>
>>   L0           L0                 L0           L0
>> /   \        /    \             /    \       /     \
>> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
>>
>> To detail:
>> We have a bunch of sets. Each set contains compute node, L0 and L1
>> switches.
>> Plus a common top of L2 and L3 switches.
>>
>> In each set, there are groups of compute nodes. Each group is
>> connected to a single L0 switch.
>> In a given set, all L0 are connected to all L1.
>>
>> The Nth L1 of a set is connected to the Nth L2 and only to this one.
>> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
>> There are Services nodes connected to the L2 switches.
>> All the L2 are connected to a couple of L3.
>>
>> The problem we have seen when routing on this topology is that most of
>> the routes from CN to SN (service nodes) go through the L3 switches.
>> With the current algorithm, the less loaded link is choosed to go down
>> by going up. Therefore, the primary path goes through a L2, then a L3
>> from where it covers all the network.
> 
> Not sure I understand why.
> There are two stages of routing:
> 1. Route the downward paths, which is done by climbing up the tree.
> 2. Route the upward paths, which is done by descending.
> Before each climbing, the routing first configures all the upward
> paths: the first acll in the beginning of the
> __osm_ftree_fabric_route_downgoing_by_going_up() function is
> __osm_ftree_fabric_route_upgoing_by_going_down().

OK

> 
> So in your example, if some Service Node (SN) is connected to L2-i
> switch, the routing should first configure all the i'th switches
> in the L1 sets, and then all the L0 switches in all the sets because
> L1 is fully mixed to L0 in each set.
> 
> Where am I wrong?

You're right. Routes towards the service nodes are OK.

Problem is the routes from the service node.

Let's take a CN (called C0) attached to a L0.
Algorithm go one step up (downward route) to  L1-1 (could be any L1 in its set), covers all the upward routes so we have all the routes CN->C0 where the CN belong to the same set.
Next step: Algorithm doesn't have a choice here, it goes to L2-1, covers the upward routes so we have all the CN->C0 routes and S1->C0
However we don't have any routes from S[234]->C0 yet.
Next and final step, algorithm go to either one of the L3, covers upwards routes and create routes from S[234]->C0. These routes have more hops than the minimum possible.
> 
>> This wasn't acceptable for us as L3 switches would be overloaded when
>> there were less loaded/shorter paths to achieve the same HCA.
>>
>> So what we have introduced here is a "balanced min_hop" within the
>> ftree algorithm.
>> Basically, instead of just leaving when we reach a LFT which has
>> already been configured for the target lid, we check the hops count of
>> the switch toward this lid, and the hop count on the path we came
>> through. If we have found a shorter path, we update the LFT and minhop
>> tables to use this new path.
> 
> I don't think that fat-tree routing was supposed to configure a route
> with more hops that min-hop would do it. Either I don't understand the
> example, or there is a bug in the routing.

If it isn't suppose to, then it's a bug :) 
However this bug only happens with nodes not at the same level as compute nodes.
> 
> -- Yevgeny
>