[ofa-general] [RFC] Fat-Tree upgrades

Nicolas Morey-Chaisemartin devel at morey-chaisemartin.com
Wed Mar 4 12:26:51 PST 2009


Yevgeny Kliteynik a écrit :
> Hi Nicolas,
> 
> Nicolas Morey Chaisemartin wrote:
>> Yevgeny Kliteynik wrote:
>>> Nicolas Morey Chaisemartin wrote:
>>>> Hi everyone,
>>>>
>>>> We have been working quite a lot at Bull lately on the Ftree algorithm
>>>> and we have made some upgrades.
>>>> However, as they modify the behavior of the ftree algorithm, we
>>>> haven't pushed them until now.
>>>> I'm just going to detail which upgrades we have done and let you
>>>> decide if you are interested, if and how they should be pushed
>>>> upstream (new routing algorithm, option in the ftree, etc.)
>>>>
>>>> Here is a simplify model of the topology we have been working on
>>>>
>>>>                         L3  L3
>>>>       ___________________|__|____________________
>>>>      /          /               \               \                <=
>>>> All the L2 are connected on 2 L3 switches
>>>>   L2-1         L2-2            L2-1           L2-2                 <=
>>>> There are service nodes connected directly on L2 switches
>>>>  /    \S1       / \S2       S3/    \       S4/   \                 <==
>>>> The Nth L1  of a set leads only to the Nth L2 (L2-N). With some
>>>> pruning.
>>>>  L1           L1                 L1             L1
>>>>  /|\         /|\                 /|\           /|\
>>>> ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have
>>>> multiple set. In each set, all L0 lead to all L1 of their set.
>>>>
>>>>   L0           L0                 L0           L0
>>>> /   \        /    \             /    \       /     \
>>>> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
>>>>
>>>> To detail:
>>>> We have a bunch of sets. Each set contains compute node, L0 and L1
>>>> switches.
>>>> Plus a common top of L2 and L3 switches.
>>>>
>>>> In each set, there are groups of compute nodes. Each group is
>>>> connected to a single L0 switch.
>>>> In a given set, all L0 are connected to all L1.
>>>>
>>>> The Nth L1 of a set is connected to the Nth L2 and only to this one.
>>>> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
>>>> There are Services nodes connected to the L2 switches.
>>>> All the L2 are connected to a couple of L3.
>>>>
>>>> The problem we have seen when routing on this topology is that most of
>>>> the routes from CN to SN (service nodes) go through the L3 switches.
>>>> With the current algorithm, the less loaded link is choosed to go down
>>>> by going up. Therefore, the primary path goes through a L2, then a L3
>>>> from where it covers all the network.
>>> Not sure I understand why.
>>> There are two stages of routing:
>>> 1. Route the downward paths, which is done by climbing up the tree.
>>> 2. Route the upward paths, which is done by descending.
>>> Before each climbing, the routing first configures all the upward
>>> paths: the first acll in the beginning of the
>>> __osm_ftree_fabric_route_downgoing_by_going_up() function is
>>> __osm_ftree_fabric_route_upgoing_by_going_down().
>>
>> OK
>>
>>> So in your example, if some Service Node (SN) is connected to L2-i
>>> switch, the routing should first configure all the i'th switches
>>> in the L1 sets, and then all the L0 switches in all the sets because
>>> L1 is fully mixed to L0 in each set.
>>>
>>> Where am I wrong?
>>
>> You're right. Routes towards the service nodes are OK.
>>
>> Problem is the routes from the service node.
>>
>> Let's take a CN (called C0) attached to a L0.
>> Algorithm go one step up (downward route) to  L1-1 (could be any L1 in
>> its set), covers all the upward routes so we have all the routes
>> CN->C0 where the CN belong to the same set.
>> Next step: Algorithm doesn't have a choice here, it goes to L2-1,
>> covers the upward routes so we have all the CN->C0 routes and S1->C0
>> However we don't have any routes from S[234]->C0 yet.
>> Next and final step, algorithm go to either one of the L3, covers
>> upwards routes and create routes from S[234]->C0. These routes have
>> more hops than the minimum possible.
> 
> OK, I get it now.
> So there's is a general issue with the existing algorithm:
> in topologies with redundant switch level(s), such as L3 in
> your example, fat-tree might create switch-to-CN routes (or
> in your example, SN-to-CN when SN is on a higher tree level
> than CN) that are longer than min-hop.
> 
> Currently, when the algorithm goes up, it checks whether or
> not the upper switch already has this LID in the LFT, and
> if it has - the route is abandoned.
> 
> What you're suggesting is to check the path length, and if
> the new path is shorter, then it should replace the old one.
> Sounds like a great idea.
> 
Thanks

>>>> This wasn't acceptable for us as L3 switches would be overloaded when
>>>> there were less loaded/shorter paths to achieve the same HCA.
>>>>
>>>> So what we have introduced here is a "balanced min_hop" within the
>>>> ftree algorithm.
>>>> Basically, instead of just leaving when we reach a LFT which has
>>>> already been configured for the target lid, we check the hops count of
>>>> the switch toward this lid, and the hop count on the path we came
>>>> through. If we have found a shorter path, we update the LFT and minhop
>>>> tables to use this new path.
> 
> And what do you do after that? Do you continue climbing
> the tree and checking the route length?
 I guess you need
> to continue climbing till all the upper switches will
> have LFT entries with route length equal to min-hop.


It depends on what you call climbing.
On its way up (creating downward routes) you can't find a shorter path (as long as a switch only belongs to one level).
On your way down (upward routes) you have to carry to propagate the new shorter path.


> I do have some concerns about the CN-to-CN routes in this
> case. Any chance these routes might be affected?
> 
> My intuition says that they won't, because they are required
> to reside at the same tree level, and since there is full
> connectivity between them, they will be all configured with
> the shortest path. But I'm not done convincing myself...
> 


Yes. On irregular topology (switch/link loss) you may see such a thing.

If we take our example again.
Imagine the link between L2-1 and the first L1 of the second set broke down.
We take the same way I describe earlier (C0-->L1-1->L2-1)
As your link broke down you can't reach the second set from here so the main route will go through L3.
However using the solution I proposed earlier, another route will be created through another L2.

But imho, it's a good thing.

The important idea coming with this fix is that there isn't much difference anymore between primary path and secondary paths. A primary path is just the first path to be explore while the secondary is a later one.
On regular topology, the primary path will always be used.
However, as soon as your topology gets irregular (which on big networks due to link/switch problems may be quite often), the effects of the minhop start kicking in, and according to our tests we get _much better_ results.

To achieve this, more than testing if your path is shorter, you need to equilibrate the secondary routes so they are exlored in the right order. If your first secondary route is a really loaded one, the minhop may create a new route going through here which is a bad thing.
What I have done to fix this is instead of finding the lowest loaded port group (primary) and then explore the others (secondaries), I sort  them all and explore them in sorted order.

To test these algorithms we have written some scripts.
We basically calculate the number of routes (CN to CN) per link on the whole network. The better the algorithm is, the less point you should have (on regular topology, theoretical points is about one per level). 
We mesure this for each switch/link loss using ibsim/opensm and aggregate all the results. 
Using the minhop and sorted secondary routes, the behaviour of the algorithm is totally changed, in well :)

I'm also working on some more enhancements and speed upgrades (sorting costs a bit of time but I've managed to reduce the cost a lot). I'll present them when they are ready.

Anyway if you feel like this idea could be integrated in the ftree algorithm, I have patches working, I'll just need to repack them nicely and send them.

Nicolas



More information about the general mailing list