[ofa-general] [RFC] Fat-Tree upgrades

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Wed Mar 4 07:37:26 PST 2009


Hi Nicolas,

Nicolas Morey Chaisemartin wrote:
> Yevgeny Kliteynik wrote:
>> Nicolas Morey Chaisemartin wrote:
>>> Hi everyone,
>>>
>>> We have been working quite a lot at Bull lately on the Ftree algorithm
>>> and we have made some upgrades.
>>> However, as they modify the behavior of the ftree algorithm, we
>>> haven't pushed them until now.
>>> I'm just going to detail which upgrades we have done and let you
>>> decide if you are interested, if and how they should be pushed
>>> upstream (new routing algorithm, option in the ftree, etc.)
>>>
>>> Here is a simplify model of the topology we have been working on
>>>
>>>                         L3  L3
>>>       ___________________|__|____________________
>>>      /          /               \               \                <=
>>> All the L2 are connected on 2 L3 switches
>>>   L2-1         L2-2            L2-1           L2-2                 <=
>>> There are service nodes connected directly on L2 switches
>>>  /    \S1       / \S2       S3/    \       S4/   \                 <==
>>> The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
>>>  L1           L1                 L1             L1
>>>  /|\         /|\                 /|\           /|\
>>> ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have
>>> multiple set. In each set, all L0 lead to all L1 of their set.
>>>
>>>   L0           L0                 L0           L0
>>> /   \        /    \             /    \       /     \
>>> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
>>>
>>> To detail:
>>> We have a bunch of sets. Each set contains compute node, L0 and L1
>>> switches.
>>> Plus a common top of L2 and L3 switches.
>>>
>>> In each set, there are groups of compute nodes. Each group is
>>> connected to a single L0 switch.
>>> In a given set, all L0 are connected to all L1.
>>>
>>> The Nth L1 of a set is connected to the Nth L2 and only to this one.
>>> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
>>> There are Services nodes connected to the L2 switches.
>>> All the L2 are connected to a couple of L3.
>>>
>>> The problem we have seen when routing on this topology is that most of
>>> the routes from CN to SN (service nodes) go through the L3 switches.
>>> With the current algorithm, the less loaded link is choosed to go down
>>> by going up. Therefore, the primary path goes through a L2, then a L3
>>> from where it covers all the network.
>> Not sure I understand why.
>> There are two stages of routing:
>> 1. Route the downward paths, which is done by climbing up the tree.
>> 2. Route the upward paths, which is done by descending.
>> Before each climbing, the routing first configures all the upward
>> paths: the first acll in the beginning of the
>> __osm_ftree_fabric_route_downgoing_by_going_up() function is
>> __osm_ftree_fabric_route_upgoing_by_going_down().
> 
> OK
> 
>> So in your example, if some Service Node (SN) is connected to L2-i
>> switch, the routing should first configure all the i'th switches
>> in the L1 sets, and then all the L0 switches in all the sets because
>> L1 is fully mixed to L0 in each set.
>>
>> Where am I wrong?
> 
> You're right. Routes towards the service nodes are OK.
> 
> Problem is the routes from the service node.
> 
> Let's take a CN (called C0) attached to a L0.
> Algorithm go one step up (downward route) to  L1-1 (could be any L1 in its set), covers all the upward routes so we have all the routes CN->C0 where the CN belong to the same set.
> Next step: Algorithm doesn't have a choice here, it goes to L2-1, covers the upward routes so we have all the CN->C0 routes and S1->C0
> However we don't have any routes from S[234]->C0 yet.
> Next and final step, algorithm go to either one of the L3, covers upwards routes and create routes from S[234]->C0. These routes have more hops than the minimum possible.

OK, I get it now.
So there's is a general issue with the existing algorithm:
in topologies with redundant switch level(s), such as L3 in
your example, fat-tree might create switch-to-CN routes (or
in your example, SN-to-CN when SN is on a higher tree level
than CN) that are longer than min-hop.

Currently, when the algorithm goes up, it checks whether or
not the upper switch already has this LID in the LFT, and
if it has - the route is abandoned.

What you're suggesting is to check the path length, and if
the new path is shorter, then it should replace the old one.
Sounds like a great idea.

>>> This wasn't acceptable for us as L3 switches would be overloaded when
>>> there were less loaded/shorter paths to achieve the same HCA.
>>>
>>> So what we have introduced here is a "balanced min_hop" within the
>>> ftree algorithm.
>>> Basically, instead of just leaving when we reach a LFT which has
>>> already been configured for the target lid, we check the hops count of
>>> the switch toward this lid, and the hop count on the path we came
>>> through. If we have found a shorter path, we update the LFT and minhop
>>> tables to use this new path.

And what do you do after that? Do you continue climbing
the tree and checking the route length? I guess you need
to continue climbing till all the upper switches will
have LFT entries with route length equal to min-hop.
I do have some concerns about the CN-to-CN routes in this
case. Any chance these routes might be affected?

My intuition says that they won't, because they are required
to reside at the same tree level, and since there is full
connectivity between them, they will be all configured with
the shortest path. But I'm not done convincing myself...

-- Yevgeny

>> I don't think that fat-tree routing was supposed to configure a route
>> with more hops that min-hop would do it. Either I don't understand the
>> example, or there is a bug in the routing.
> 
> If it isn't suppose to, then it's a bug :) 
> However this bug only happens with nodes not at the same level as compute nodes.
>> -- Yevgeny
>>
> 




More information about the general mailing list