[ofa-general] [RFC] Fat-Tree upgrades

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Wed Mar 4 02:15:27 PST 2009


Nicolas Morey Chaisemartin wrote:
> Hi everyone,
> 
> We have been working quite a lot at Bull lately on the Ftree algorithm 
> and we have made some upgrades.
> However, as they modify the behavior of the ftree algorithm, we haven't 
> pushed them until now.
> I'm just going to detail which upgrades we have done and let you decide 
> if you are interested, if and how they should be pushed upstream (new 
> routing algorithm, option in the ftree, etc.)
> 
> Here is a simplify model of the topology we have been working on
> 
>                         L3  L3
>       ___________________|__|____________________
>      /          /               \               \                <= All 
> the L2 are connected on 2 L3 switches
>   L2-1         L2-2            L2-1           L2-2                 <= 
> There are service nodes connected directly on L2 switches
>  /    \S1       / \S2       S3/    \       S4/   \                 <== 
> The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
>  L1           L1                 L1             L1
>  /|\         /|\                 /|\           /|\
> ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have 
> multiple set. In each set, all L0 lead to all L1 of their set.
> 
>   L0           L0                 L0           L0
> /   \        /    \             /    \       /     \
> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
> 
> To detail:
> We have a bunch of sets. Each set contains compute node, L0 and L1 
> switches.
> Plus a common top of L2 and L3 switches.
> 
> In each set, there are groups of compute nodes. Each group is connected 
> to a single L0 switch.
> In a given set, all L0 are connected to all L1.
> 
> The Nth L1 of a set is connected to the Nth L2 and only to this one. (so 
> through a L2, the Nth L1 can only see the Nth L1 of the other sets)
> There are Services nodes connected to the L2 switches.
> All the L2 are connected to a couple of L3.
> 
> The problem we have seen when routing on this topology is that most of 
> the routes from CN to SN (service nodes) go through the L3 switches. 
> With the current algorithm, the less loaded link is choosed to go down 
> by going up. Therefore, the primary path goes through a L2, then a L3 
> from where it covers all the network.

Not sure I understand why.
There are two stages of routing:
1. Route the downward paths, which is done by climbing up the tree.
2. Route the upward paths, which is done by descending.
Before each climbing, the routing first configures all the upward
paths: the first acll in the beginning of the 
__osm_ftree_fabric_route_downgoing_by_going_up() function is 
__osm_ftree_fabric_route_upgoing_by_going_down().

So in your example, if some Service Node (SN) is connected to 
L2-i switch, the routing should first configure all the i'th switches
in the L1 sets, and then all the L0 switches in all the sets because
L1 is fully mixed to L0 in each set.

Where am I wrong? 


> This wasn't acceptable for us as L3 switches would be overloaded when 
> there were less loaded/shorter paths to achieve the same HCA.
> 
> So what we have introduced here is a "balanced min_hop" within the ftree 
> algorithm.
> Basically, instead of just leaving when we reach a LFT which has already 
> been configured for the target lid, we check the hops count of the 
> switch toward this lid, and the hop count on the path we came through. 
> If we have found a shorter path, we update the LFT and minhop tables to 
> use this new path.

I don't think that fat-tree routing was supposed to configure a route
with more hops that min-hop would do it. Either I don't understand 
the example, or there is a bug in the routing.

-- Yevgeny

> This means that the difference between primary_path and secondary path 
> is not so important anymore.
> Secondary path may increment port counters but only if routes to HCA 
> were created (see opensm/osm_ucast_ftree.c: Fixed bug on index    port 
> incrementation which makes this possible).
> I acknowledge that port count may be slightly wrong as a primary path 
> that is replaced with a shorter secondary path has incremented counters 
> and they won't be removed. However, in most the cases the primary path 
> would have created other routes than the one replaced so counters are fine.
> 
> For all regular ftree topology, I have see no change with this update 
> but with topologies where two levels are not fully interconnected, this 
> helps a lot !
> 
> 
> 
> Another thing we have developped here is to balance more secondary path.
> In the current algorithm, secondary down path (going_down_by_going up) 
> are created in port_group order.
> This means that if the primary path didn't reach all the network 
> (because a switch is broken for examples), all the routes missing will 
> be created through the first port group. Which unbalance the network 
> load a lot.
> To solve this,  we create the secondary path by port group load.
> The previous patch has made us increment the port/portgroup counters  
> when secondary routes towards HCA are created, therefore these counters 
> are significant even when creating secondary routes.
> What our patch does is at the beginning of the function sort all the 
> port group from lowest load to highest. Pick the first one for the 
> primary path, and try secondary path from the 2nd to the last.
> Once again this seems to have no effect on regular topology but it made 
> a real impact on our failover tests.
> 
> 
> 
> Feel free to comment this, and more specially if and how you would want 
> them upstream.
> 
> Thanks in advance
> 
> Nicolas
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 




More information about the general mailing list