[ofa-general] Re: [PATCH v3] opensm/osm_ucast_ftree.c: Fixed bug on index port incrementation

Sasha Khapyorsky sashak at voltaire.com
Mon Feb 9 11:44:19 PST 2009


On 16:55 Mon 09 Feb     , Nicolas Morey Chaisemartin wrote:
> This patch fixes a bug in index port incrementation in the fat-tree 
> algorithm.
> Problem happens (at least) with a 4 level Fat tree as below:
>
>
>                          L3  L3
>        ___________________|__|____________________
>       /          /               \               \                <= All 
> the L2 are connected on 2 L3 switches
>    L2-1         L2-2            L2-1           L2-2
>   /             /                 \              \                 <== The 
> Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
>   L1           L1                 L1             L1
>   /|\         /|\                 /|\           /|\
>  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have 
> multiple set. In each set, all L0 lead to all L1 of their set.
>
>    L0           L0                 L0           L0
>  /   \        /    \             /    \       /     \
> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
>
>
> To detail:
> We have a bunch of sets. Each set contains compute node, L0 and L1 
> switches.
> Plus a common top of L2 and L3 switches.
>
> In each set, there are groups of compute nodes. Each group is connected to 
> a single L0 switch.
> In a given set, all L0 are connected to all L1.
>
> The Nth L1 of a set is connected to the Nth L2 and only to this one. (so 
> through a L2, the Nth L1 can only see the Nth L1 of the other sets)
> All the L2 are connected to a couple of L3.
>
>
> If we dont put the L3. We have a perfectly balanced fat tree and well 
> equilibrated routes.
> But when we add the L3, it introduce a huge difference. As it is not 
> necessary, no route is going through L3 (which is fine).
> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 
> is twice overused (compared to the balanced state).
>
> This comes from the down_port_groups_idx which is incremented each time the 
> algorithm goes down through a node whether it creates routes to HCA (port 
> != switch)
> or not. As route coming up from a L1 reaches only one L2, the algorithm 
> goes through all the other L2 while going down, incrementing their index.
> Our case here is a bit specific but in a case where your L1 doesn't have 
> full connectivity to all your L2, and another switch rank above, the 
> problem may appear.
>
> To avoid this problem,  __osm_ftree_fabric_route_upgoing_by_going_down 
> function has been changed so it returns a value to indicate if routes to 
> HCA (in fact to leaf switch) were created.
> With this information, we only increase the index when the algorithm has 
> created routes to HCA.
> After applying this patch and measuring the link usage, we are perfectly 
> balanced  (L2<->L3 links are still not used but that is to be expected).
>
> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>

Applied. Thanks.

Sasha



More information about the general mailing list