[ofa-general] [RFC] Fat-Tree upgrades
Nicolas Morey Chaisemartin
nicolas.morey-chaisemartin at ext.bull.net
Wed Mar 4 03:45:56 PST 2009
Yevgeny Kliteynik wrote:
> Nicolas Morey Chaisemartin wrote:
>> Hi everyone,
>> We have been working quite a lot at Bull lately on the Ftree algorithm
>> and we have made some upgrades.
>> However, as they modify the behavior of the ftree algorithm, we
>> haven't pushed them until now.
>> I'm just going to detail which upgrades we have done and let you
>> decide if you are interested, if and how they should be pushed
>> upstream (new routing algorithm, option in the ftree, etc.)
>> Here is a simplify model of the topology we have been working on
>> L3 L3
>> / / \ \ <=
>> All the L2 are connected on 2 L3 switches
>> L2-1 L2-2 L2-1 L2-2 <=
>> There are service nodes connected directly on L2 switches
>> / \S1 / \S2 S3/ \ S4/ \ <==
>> The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning.
>> L1 L1 L1 L1
>> /|\ /|\ /|\ /|\
>> ==Fully mixed to L1== ==Fully mixed to L1== <=== We have
>> multiple set. In each set, all L0 lead to all L1 of their set.
>> L0 L0 L0 L0
>> / \ / \ / \ / \
>> CN CN .. CN CN .... CN CN .. CN CN
>> To detail:
>> We have a bunch of sets. Each set contains compute node, L0 and L1
>> Plus a common top of L2 and L3 switches.
>> In each set, there are groups of compute nodes. Each group is
>> connected to a single L0 switch.
>> In a given set, all L0 are connected to all L1.
>> The Nth L1 of a set is connected to the Nth L2 and only to this one.
>> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
>> There are Services nodes connected to the L2 switches.
>> All the L2 are connected to a couple of L3.
>> The problem we have seen when routing on this topology is that most of
>> the routes from CN to SN (service nodes) go through the L3 switches.
>> With the current algorithm, the less loaded link is choosed to go down
>> by going up. Therefore, the primary path goes through a L2, then a L3
>> from where it covers all the network.
> Not sure I understand why.
> There are two stages of routing:
> 1. Route the downward paths, which is done by climbing up the tree.
> 2. Route the upward paths, which is done by descending.
> Before each climbing, the routing first configures all the upward
> paths: the first acll in the beginning of the
> __osm_ftree_fabric_route_downgoing_by_going_up() function is
> So in your example, if some Service Node (SN) is connected to L2-i
> switch, the routing should first configure all the i'th switches
> in the L1 sets, and then all the L0 switches in all the sets because
> L1 is fully mixed to L0 in each set.
> Where am I wrong?
You're right. Routes towards the service nodes are OK.
Problem is the routes from the service node.
Let's take a CN (called C0) attached to a L0.
Algorithm go one step up (downward route) to L1-1 (could be any L1 in its set), covers all the upward routes so we have all the routes CN->C0 where the CN belong to the same set.
Next step: Algorithm doesn't have a choice here, it goes to L2-1, covers the upward routes so we have all the CN->C0 routes and S1->C0
However we don't have any routes from S->C0 yet.
Next and final step, algorithm go to either one of the L3, covers upwards routes and create routes from S->C0. These routes have more hops than the minimum possible.
>> This wasn't acceptable for us as L3 switches would be overloaded when
>> there were less loaded/shorter paths to achieve the same HCA.
>> So what we have introduced here is a "balanced min_hop" within the
>> ftree algorithm.
>> Basically, instead of just leaving when we reach a LFT which has
>> already been configured for the target lid, we check the hops count of
>> the switch toward this lid, and the hop count on the path we came
>> through. If we have found a shorter path, we update the LFT and minhop
>> tables to use this new path.
> I don't think that fat-tree routing was supposed to configure a route
> with more hops that min-hop would do it. Either I don't understand the
> example, or there is a bug in the routing.
If it isn't suppose to, then it's a bug :)
However this bug only happens with nodes not at the same level as compute nodes.
> -- Yevgeny
More information about the general