[ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on index port order incrementation
Nicolas Morey Chaisemartin
nicolas.morey-chaisemartin at ext.bull.net
Wed Feb 4 02:37:43 PST 2009
Yevgeny Kliteynik wrote:
> Hi Nicolas,
>
> Nicolas Morey Chaisemartin wrote:
>> Hello,
>>
>> While doing some routing analysis on fat tree using ibsim we found a
>> "bug" in the fat-tree algorithm.
>> Problem happens with a 4 level Fat tree as below:
>>
>>
>> L3 L3
>> ___________________|__|____________________
>> / / \ \ <=
>> All the L2 are connected on 2 L3 switches
>> L2-1 L2-2 L2-1 L2-2
>> / / \ \
>> <== The Nth L1 of a set leads only to the Nth L2 (L2-N). With some
>> pruning.
>> L1 L1 L1 L1
>> /|\ /|\ /|\ /|\
>> ==Fully mixed to L1== ==Fully mixed to L1== <=== We
>> have multiple set. In each set, all L0 lead to all L1 of their set.
>>
>> L0 L0 L0 L0
>> / \ / \ / \ / \
>> CN CN .. CN CN .... CN CN .. CN CN
>>
>>
>> To detail:
>> We have a bunch of sets. Each set contains compute node, L0 and L1
>> switches.
>> Plus a common top of L2 and L3 switches.
>>
>> In each set, there are groups of compute nodes. Each group is
>> connected to a single L0 switch.
>> In a given set, all L0 are connected to all L1.
>>
>> The Nth L1 of a set is connected to the Nth L2 and only to this one.
>> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
>> All the L2 are connected to a couple of L3.
>>
>>
>> If we dont put the L3. We have a perfectly equilibrated fat tree and
>> well equilibrated routes.
>> But when we add the L3, it introduce a huge difference. As it is not
>> necessary, no route is going through L3 (which is fine).
>> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and
>> 1/4 is twice overused (compared to the equilibrate state).
>>
>> This comes from the down_port_groups_idx which is incremented each
>> time the algorithm goes down through a node whether it creates routes
>> to HCA (port != switch)
>> or not. As route coming up from a L1 reaches only one L2, the
>> algorithm goes through all the other L2 while going down,
>> incrementing their index.
>> Our case here is a bit specific but in a case where your L1 doesn't
>> have full connectivity to all your L2, and another switch rank above,
>> the problem may appear.
>>
>> To avoid this problem, I've changed the
>> __osm_ftree_fabric_route_upgoing_by_going_down function so it returns
>> a value to indicate if routes to HCA (in fact to leaf switch) were
>> created.
>> With this information, we only increase the index when the algorithm
>> has created routes to HCA.
>> After applying this patch and measuring the link usage, we are at
>> perfect equilibrium (L2<->L3 links are still not used but that is to
>> be expected).
>
> Great! I've actually seen this problem on a real clusters, but
> couldn't understand what's cusing the lack of equilibrity.
>
> See couple of questions below.
>
>> Signed-off-by: Nicolas Morey-Chaisemartin
>> <nicolas.morey-chaisemartin at ext.bull.net>
>> ---
>> opensm/opensm/osm_ucast_ftree.c | 23 ++++++++++++++---------
>> 1 files changed, 14 insertions(+), 9 deletions(-)
>>
>> ------------------------------------------------------------------------
>>
>> diff --git a/opensm/opensm/osm_ucast_ftree.c
>> b/opensm/opensm/osm_ucast_ftree.c
>> index ebe6612..3474876 100644
>> --- a/opensm/opensm/osm_ucast_ftree.c
>> +++ b/opensm/opensm/osm_ucast_ftree.c
>> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN
>> cl_map_item_t * const p_map_item,
>> * assign-up-going-port-by-descending-down to r-port node
>> (recursion)
>> */
>>
>> -static void
>> +static int
>> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t *
>> p_ftree,
>> IN ftree_sw_t * p_sw,
>> IN ftree_sw_t * p_prev_sw,
>> @@ -1932,21 +1932,23 @@
>> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t *
>> p_ftree,
>> uint16_t i;
>> uint16_t j;
>> uint16_t k;
>> + uint8_t created_route=0;
>>
>> /* we shouldn't enter here if both real_lid and main_path are
>> false */
>> CL_ASSERT(is_real_lid || is_main_path);
>>
>> /* if there is no down-going ports */
>> if (p_sw->down_port_groups_num == 0)
>> - return;
>> + return 1;
>
> Shouldn't it return 0?
Probably yes. I was thinking to the case where (taking notations from my
scheme above) a L0 wouldn't have any CN (beaucse they are shutdown,
broken, or for future extension). In this case, I think it'll smooth
things a bit and not desequilibrate the network.
>> - /* promote the index that indicates which group should we
>> - start with when going through all the downgoing groups */
>> - p_sw->down_port_groups_idx =
>> - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
>> + /* If we are on a leaf switch we should be creating routes for
>> real HCA */
>> + /* This flag will be returned so upper layers will incrementent
>> shift index */
>> + if(p_sw->is_leaf == TRUE){
>> + created_route=1;
>> + }
>
> The "is_leaf" flag will be TRUE only on leaf switches that have CNs
> connected to them.
> If we want to solve the problem for all routes (CNs, IO nodes,
> management nodes),
> the "created_route" flag should be updated elsewhere (see below).
I was not aware of this, so yes it should be done somewhere else (though
it can also be done here).
>
>> /* foreach down-going port group (in indexing order) */
>> - i = p_sw->down_port_groups_idx;
>> + i = (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
>> for (k = 0; k < p_sw->down_port_groups_num; k++) {
>
> I think that since p_sw->down_port_groups_idx is promoted below,
> there is no need to increase the starting value of i.
>
I tried to but I had some problem (segfault probably due to the
%p_sw->down_port_groups_num).
If it works without incrementing, it's fine with me.
>> + if(created_route)
>> + p_sw->down_port_groups_idx = +
>> (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> ...
>> + return created_route;
>> } /* __osm_ftree_fabric_route_upgoing_by_going_down() */
>>
>> /***************************************************/
>
> How about something like this:
>
> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN
> cl_map_item_t * const p_map_item,
> * assign-up-going-port-by-descending-down to r-port node
> (recursion)
> */
>
> -static void
> +static boolean_t
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t *
> p_ftree,
> IN ftree_sw_t * p_sw,
> IN ftree_sw_t * p_prev_sw,
> @@ -1932,18 +1932,14 @@
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t *
> p_ftree,
> uint16_t i;
> uint16_t j;
> uint16_t k;
> + boolean_t created_route = FALSE;
>
> /* we shouldn't enter here if both real_lid and main_path are
> false */
> CL_ASSERT(is_real_lid || is_main_path);
>
> /* if there is no down-going ports */
> if (p_sw->down_port_groups_num == 0)
> - return;
> -
> - /* promote the index that indicates which group should we
> - start with when going through all the downgoing groups */
> - p_sw->down_port_groups_idx =
> - (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> + return FALSE;
>
> /* foreach down-going port group (in indexing order) */
> i = p_sw->down_port_groups_idx;
> @@ -1952,9 +1948,12 @@
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t *
> p_ftree,
> p_group = p_sw->down_port_groups[i];
> i = (i + 1) % p_sw->down_port_groups_num;
>
> - /* Skip this port group unless it points to a switch */
> - if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
> + /* If this port group doesn't point to a switch, mark
> + that the route was created and skip to the next group */
> + if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) {
> + created_route = TRUE;
> continue;
> + }
>
> if (p_prev_sw
> && (p_group->remote_base_lid == p_prev_sw->base_lid)) {
> @@ -2073,16 +2072,25 @@
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t *
> p_ftree,
>
> /* Recursion step:
> Assign upgoing ports by stepping down, starting on REMOTE
> switch */
> - __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree,
> p_remote_sw, /* remote switch - used as a route-upgoing alg. start
> point */
> - NULL, /* prev. position - NULL
> to mark that we went down and not up */
> - target_lid, /* LID that we're
> routing to */
> - target_rank, /* rank of the LID
> that we're routing to */
> - is_real_lid, /* whether the
> target LID is real or dummy */
> - is_main_path, /* whether this
> is path to HCA that should by tracked by counters */
> - highest_rank_in_route); /*
> highest visited point in the tree before going down */
> + created_route |=
> __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree,
> + p_remote_sw, /* remote switch - used as a
> route-upgoing alg. start point */
> + NULL, /* prev. position - NULL to mark that we
> went down and not up */
> + target_lid, /* LID that we're routing to */
> + target_rank, /* rank of the LID that we're routing to */
> + is_real_lid, /* whether the target LID is real or
> dummy */
> + is_main_path, /* whether this is path to HCA that
> should by tracked by counters */
> + highest_rank_in_route); /* highest visited point in
> the tree before going down */
> }
> /* done scanning all the down-going port groups */
>
> + /* if the route was created, promote the index that
> + indicates which group should we start with when
> + going through all the downgoing groups */
> + if (created_route)
> + p_sw->down_port_groups_idx =
> + (p_sw->down_port_groups_idx + 1) %
> p_sw->down_port_groups_num;
> +
> + return created_route;
> } /* __osm_ftree_fabric_route_upgoing_by_going_down() */
>
> /***************************************************/
>
>
>
>
>
>
That seems good.
I'm going to think a bit more about the case where there are no downports.
Best regards
Nicolas
More information about the general
mailing list