[ofa-general] [RFC] Fat-Tree upgrades

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Thu Mar 5 01:04:26 PST 2009


Hi Nicolas,

Nicolas Morey-Chaisemartin wrote:

>>>>> So what we have introduced here is a "balanced min_hop" within the
>>>>> ftree algorithm.
>>>>> Basically, instead of just leaving when we reach a LFT which has
>>>>> already been configured for the target lid, we check the hops count of
>>>>> the switch toward this lid, and the hop count on the path we came
>>>>> through. If we have found a shorter path, we update the LFT and minhop
>>>>> tables to use this new path.
>> And what do you do after that? Do you continue climbing
>> the tree and checking the route length?
>  I guess you need
>> to continue climbing till all the upper switches will
>> have LFT entries with route length equal to min-hop.
> 
> 
> It depends on what you call climbing.
> On its way up (creating downward routes) you can't find a shorter path (as long as a switch only belongs to one level).
> On your way down (upward routes) you have to carry to propagate the new shorter path.

Not exactly. Even on your way up (while creating downward routes)
you can find shorter paths than what was configured by main path.
It's hard to explain in words, so I draw some ppt slides (see the
attachement).
 
>> I do have some concerns about the CN-to-CN routes in this
>> case. Any chance these routes might be affected?
>>
>> My intuition says that they won't, because they are required
>> to reside at the same tree level, and since there is full
>> connectivity between them, they will be all configured with
>> the shortest path. But I'm not done convincing myself...
>>
> 
> Yes. On irregular topology (switch/link loss) you may see such a thing.
> 
> If we take our example again.
> Imagine the link between L2-1 and the first L1 of the second set broke down.
> We take the same way I describe earlier (C0-->L1-1->L2-1)
> As your link broke down you can't reach the second set from here so the main route will go through L3.
> However using the solution I proposed earlier, another route will be created through another L2.
> 
> But imho, it's a good thing.

Agree.
 
> The important idea coming with this fix is that there isn't much difference anymore between primary path and secondary paths. A primary path is just the first path to be explore while the secondary is a later one.
> On regular topology, the primary path will always be used.
> However, as soon as your topology gets irregular (which on big networks due to link/switch problems may be quite often), the effects of the minhop start kicking in, and according to our tests we get _much better_ results.

Great :)

> To achieve this, more than testing if your path is shorter, you need to equilibrate the secondary routes so they are exlored in the right order. If your first secondary route is a really loaded one, the minhop may create a new route going through here which is a bad thing.

Right, and since the whole point of this min-hop in fat-tree kicks
in only in an irregular or faulty topologies, you will have such
loaded links by definition.

> What I have done to fix this is instead of finding the lowest loaded port group (primary) and then explore the others (secondaries), I sort  them all and explore them in sorted order.

Sounds great
 
> To test these algorithms we have written some scripts.
> We basically calculate the number of routes (CN to CN) per link on the whole network.
> The better the algorithm is, the less point you should have (on regular topology, theoretical points is about one per level). 
> We mesure this for each switch/link loss using ibsim/opensm and aggregate all the results. 
> Using the minhop and sorted secondary routes, the behaviour of the algorithm is totally changed, in well :)
> 
> I'm also working on some more enhancements and speed upgrades (sorting costs a bit of time but I've managed to reduce the cost a lot). I'll present them when they are ready.

The speed penalty was my other concern.
Will be happy to see the comparison of the calculation time with and w/o the sorting once its ready.

> Anyway if you feel like this idea could be integrated in the ftree algorithm, I have patches working, I'll just need to repack them nicely and send them.

Sure, this is a very important addition to the fat-tree routing.

-- Yevgeny

 
> Nicolas
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fat-tree.ppt
Type: application/vnd.ms-powerpoint
Size: 72704 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090305/60cb8afa/attachment.ppt>


More information about the general mailing list