[ofa-general] [PATCH v3] opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation

Nicolas Morey Chaisemartin nicolas.morey-chaisemartin at ext.bull.net
Mon Feb 9 07:55:46 PST 2009


This patch fixes a bug in index port incrementation in the fat-tree algorithm.
Problem happens (at least) with a 4 level Fat tree as below:


                          L3  L3
        ___________________|__|____________________
       /          /               \               \                <= All the L2 are connected on 2 L3 switches
    L2-1         L2-2            L2-1           L2-2
   /             /                 \              \                 <== The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
   L1           L1                 L1             L1
   /|\         /|\                 /|\           /|\
  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have multiple set. In each set, all L0 lead to all L1 of their set.

    L0           L0                 L0           L0
  /   \        /    \             /    \       /     \
CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN


To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1 switches.
Plus a common top of L2 and L3 switches.

In each set, there are groups of compute nodes. Each group is connected to a single L0 switch.
In a given set, all L0 are connected to all L1.

The Nth L1 of a set is connected to the Nth L2 and only to this one. (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
All the L2 are connected to a couple of L3.


If we dont put the L3. We have a perfectly balanced fat tree and well equilibrated routes.
But when we add the L3, it introduce a huge difference. As it is not necessary, no route is going through L3 (which is fine).
However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 is twice overused (compared to the balanced state).

This comes from the down_port_groups_idx which is incremented each time the algorithm goes down through a node whether it creates routes to HCA (port != switch)
or not. As route coming up from a L1 reaches only one L2, the algorithm goes through all the other L2 while going down, incrementing their index.
Our case here is a bit specific but in a case where your L1 doesn't have full connectivity to all your L2, and another switch rank above, the problem may appear.

To avoid this problem,  __osm_ftree_fabric_route_upgoing_by_going_down function has been changed so it returns a value to indicate if routes to HCA (in fact to leaf switch) were created.
With this information, we only increase the index when the algorithm has created routes to HCA.
After applying this patch and measuring the link usage, we are perfectly balanced  (L2<->L3 links are still not used but that is to be expected).

Signed-off-by: Nicolas Morey-Chaisemartin <nicolas.morey-chaisemartin at ext.bull.net>
---
   opensm/opensm/osm_ucast_ftree.c |   39 +++++++++++++++++++++++----------------
   1 files changed, 23 insertions(+), 16 deletions(-)


Repost of the patch with Yevgeni's comment and a more complete description :)
Hope it's good this time.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2f1d358f2bdf67838fe8776438b7757d9dcd6e15.diff
Type: text/x-patch
Size: 3806 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090209/276fd8a1/attachment.bin>


More information about the general mailing list