[ofa-general] [RFC] Fat-Tree upgrades
Nicolas Morey Chaisemartin
nicolas.morey-chaisemartin at ext.bull.net
Sun Feb 8 23:40:05 PST 2009
Hi everyone,
We have been working quite a lot at Bull lately on the Ftree algorithm
and we have made some upgrades.
However, as they modify the behavior of the ftree algorithm, we haven't
pushed them until now.
I'm just going to detail which upgrades we have done and let you decide
if you are interested, if and how they should be pushed upstream (new
routing algorithm, option in the ftree, etc.)
Here is a simplify model of the topology we have been working on
L3 L3
___________________|__|____________________
/ / \ \ <= All
the L2 are connected on 2 L3 switches
L2-1 L2-2 L2-1 L2-2 <=
There are service nodes connected directly on L2 switches
/ \S1 / \S2 S3/ \ S4/ \ <==
The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning.
L1 L1 L1 L1
/|\ /|\ /|\ /|\
==Fully mixed to L1== ==Fully mixed to L1== <=== We have
multiple set. In each set, all L0 lead to all L1 of their set.
L0 L0 L0 L0
/ \ / \ / \ / \
CN CN .. CN CN .... CN CN .. CN CN
To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1
switches.
Plus a common top of L2 and L3 switches.
In each set, there are groups of compute nodes. Each group is connected
to a single L0 switch.
In a given set, all L0 are connected to all L1.
The Nth L1 of a set is connected to the Nth L2 and only to this one. (so
through a L2, the Nth L1 can only see the Nth L1 of the other sets)
There are Services nodes connected to the L2 switches.
All the L2 are connected to a couple of L3.
The problem we have seen when routing on this topology is that most of
the routes from CN to SN (service nodes) go through the L3 switches.
With the current algorithm, the less loaded link is choosed to go down
by going up. Therefore, the primary path goes through a L2, then a L3
from where it covers all the network.
This wasn't acceptable for us as L3 switches would be overloaded when
there were less loaded/shorter paths to achieve the same HCA.
So what we have introduced here is a "balanced min_hop" within the ftree
algorithm.
Basically, instead of just leaving when we reach a LFT which has already
been configured for the target lid, we check the hops count of the
switch toward this lid, and the hop count on the path we came through.
If we have found a shorter path, we update the LFT and minhop tables to
use this new path.
This means that the difference between primary_path and secondary path
is not so important anymore.
Secondary path may increment port counters but only if routes to HCA
were created (see opensm/osm_ucast_ftree.c: Fixed bug on index port
incrementation which makes this possible).
I acknowledge that port count may be slightly wrong as a primary path
that is replaced with a shorter secondary path has incremented counters
and they won't be removed. However, in most the cases the primary path
would have created other routes than the one replaced so counters are fine.
For all regular ftree topology, I have see no change with this update
but with topologies where two levels are not fully interconnected, this
helps a lot !
Another thing we have developped here is to balance more secondary path.
In the current algorithm, secondary down path (going_down_by_going up)
are created in port_group order.
This means that if the primary path didn't reach all the network
(because a switch is broken for examples), all the routes missing will
be created through the first port group. Which unbalance the network
load a lot.
To solve this, we create the secondary path by port group load.
The previous patch has made us increment the port/portgroup counters
when secondary routes towards HCA are created, therefore these counters
are significant even when creating secondary routes.
What our patch does is at the beginning of the function sort all the
port group from lowest load to highest. Pick the first one for the
primary path, and try secondary path from the 2nd to the last.
Once again this seems to have no effect on regular topology but it made
a real impact on our failover tests.
Feel free to comment this, and more specially if and how you would want
them upstream.
Thanks in advance
Nicolas
More information about the general
mailing list