[ofa-general] [RFC] Fat-Tree upgrades

Sun Feb 8 23:40:05 PST 2009

Hi everyone,

We have been working quite a lot at Bull lately on the Ftree algorithm 
and we have made some upgrades.
However, as they modify the behavior of the ftree algorithm, we haven't 
pushed them until now.
I'm just going to detail which upgrades we have done and let you decide 
if you are interested, if and how they should be pushed upstream (new 
routing algorithm, option in the ftree, etc.)

Here is a simplify model of the topology we have been working on

                         L3  L3
       ___________________|__|____________________
      /          /               \               \                <= All 
the L2 are connected on 2 L3 switches
   L2-1         L2-2            L2-1           L2-2                 <= 
There are service nodes connected directly on L2 switches
  /    \S1       / \S2       S3/    \       S4/   \                 <== 
The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
  L1           L1                 L1             L1
  /|\         /|\                 /|\           /|\
 ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have 
multiple set. In each set, all L0 lead to all L1 of their set.

   L0           L0                 L0           L0
 /   \        /    \             /    \       /     \
CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN

To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1 
switches.
Plus a common top of L2 and L3 switches.

In each set, there are groups of compute nodes. Each group is connected 
to a single L0 switch.
In a given set, all L0 are connected to all L1.

The Nth L1 of a set is connected to the Nth L2 and only to this one. (so 
through a L2, the Nth L1 can only see the Nth L1 of the other sets)
There are Services nodes connected to the L2 switches.
All the L2 are connected to a couple of L3.

The problem we have seen when routing on this topology is that most of 
the routes from CN to SN (service nodes) go through the L3 switches. 
With the current algorithm, the less loaded link is choosed to go down 
by going up. Therefore, the primary path goes through a L2, then a L3 
from where it covers all the network.
This wasn't acceptable for us as L3 switches would be overloaded when 
there were less loaded/shorter paths to achieve the same HCA.

So what we have introduced here is a "balanced min_hop" within the ftree 
algorithm.
Basically, instead of just leaving when we reach a LFT which has already 
been configured for the target lid, we check the hops count of the 
switch toward this lid, and the hop count on the path we came through. 
If we have found a shorter path, we update the LFT and minhop tables to 
use this new path.
This means that the difference between primary_path and secondary path 
is not so important anymore.
Secondary path may increment port counters but only if routes to HCA 
were created (see opensm/osm_ucast_ftree.c: Fixed bug on index    port 
incrementation which makes this possible).
I acknowledge that port count may be slightly wrong as a primary path 
that is replaced with a shorter secondary path has incremented counters 
and they won't be removed. However, in most the cases the primary path 
would have created other routes than the one replaced so counters are fine.

For all regular ftree topology, I have see no change with this update 
but with topologies where two levels are not fully interconnected, this 
helps a lot !

Another thing we have developped here is to balance more secondary path.
In the current algorithm, secondary down path (going_down_by_going up) 
are created in port_group order.
This means that if the primary path didn't reach all the network 
(because a switch is broken for examples), all the routes missing will 
be created through the first port group. Which unbalance the network 
load a lot.
To solve this,  we create the secondary path by port group load.
The previous patch has made us increment the port/portgroup counters  
when secondary routes towards HCA are created, therefore these counters 
are significant even when creating secondary routes.
What our patch does is at the beginning of the function sort all the 
port group from lowest load to highest. Pick the first one for the 
primary path, and try secondary path from the 2nd to the last.
Once again this seems to have no effect on regular topology but it made 
a real impact on our failover tests.

Feel free to comment this, and more specially if and how you would want 
them upstream.

Thanks in advance

Nicolas