[ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on index port order incrementation

Nicolas Morey Chaisemartin nicolas.morey-chaisemartin at ext.bull.net
Wed Feb 4 02:37:43 PST 2009


Yevgeny Kliteynik wrote:
> Hi Nicolas,
>
> Nicolas Morey Chaisemartin wrote:
>> Hello,
>>
>> While doing some routing analysis on fat tree using ibsim we found a 
>> "bug" in the fat-tree algorithm.
>> Problem happens with a 4 level Fat tree as below:
>>
>>
>>                          L3  L3
>>        ___________________|__|____________________
>>       /          /               \               \                <= 
>> All the L2 are connected on 2 L3 switches
>>    L2-1         L2-2            L2-1           L2-2
>>   /             /                 \              \                 
>> <== The Nth L1  of a set leads only to the Nth L2 (L2-N). With some 
>> pruning.
>>   L1           L1                 L1             L1
>>   /|\         /|\                 /|\           /|\
>>  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We 
>> have multiple set. In each set, all L0 lead to all L1 of their set.
>>
>>    L0           L0                 L0           L0
>>  /   \        /    \             /    \       /     \
>> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
>>
>>
>> To detail:
>> We have a bunch of sets. Each set contains compute node, L0 and L1 
>> switches.
>> Plus a common top of L2 and L3 switches.
>>
>> In each set, there are groups of compute nodes. Each group is 
>> connected to a single L0 switch.
>> In a given set, all L0 are connected to all L1.
>>
>> The Nth L1 of a set is connected to the Nth L2 and only to this one. 
>> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
>> All the L2 are connected to a couple of L3.
>>
>>
>> If we dont put the L3. We have a perfectly equilibrated fat tree and 
>> well equilibrated routes.
>> But when we add the L3, it introduce a huge difference. As it is not 
>> necessary, no route is going through L3 (which is fine).
>> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 
>> 1/4 is twice overused (compared to the equilibrate state).
>>
>> This comes from the down_port_groups_idx which is incremented each 
>> time the algorithm goes down through a node whether it creates routes 
>> to HCA (port != switch)
>> or not. As route coming up from a L1 reaches only one L2, the 
>> algorithm goes through all the other L2 while going down, 
>> incrementing their index.
>> Our case here is a bit specific but in a case where your L1 doesn't 
>> have full connectivity to all your L2, and another switch rank above, 
>> the problem may appear.
>>
>> To avoid this problem, I've changed the 
>> __osm_ftree_fabric_route_upgoing_by_going_down function so it returns 
>> a value to indicate if routes to HCA (in fact to leaf switch) were 
>> created.
>> With this information, we only increase the index when the algorithm 
>> has created routes to HCA.
>> After applying this patch and measuring the link usage, we are at 
>> perfect equilibrium (L2<->L3 links are still not used but that is to 
>> be expected).
>
> Great! I've actually seen this problem on a real clusters, but
> couldn't understand what's cusing the lack of equilibrity.
>
> See couple of questions below.
>
>> Signed-off-by: Nicolas Morey-Chaisemartin 
>> <nicolas.morey-chaisemartin at ext.bull.net>
>> ---
>>  opensm/opensm/osm_ucast_ftree.c |   23 ++++++++++++++---------
>>  1 files changed, 14 insertions(+), 9 deletions(-)
>>
>> ------------------------------------------------------------------------
>>
>> diff --git a/opensm/opensm/osm_ucast_ftree.c 
>> b/opensm/opensm/osm_ucast_ftree.c
>> index ebe6612..3474876 100644
>> --- a/opensm/opensm/osm_ucast_ftree.c
>> +++ b/opensm/opensm/osm_ucast_ftree.c
>> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN 
>> cl_map_item_t * const p_map_item,
>>   *        assign-up-going-port-by-descending-down to r-port node 
>> (recursion)
>>   */
>>  
>> -static void
>> +static int
>>  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
>> p_ftree,
>>                             IN ftree_sw_t * p_sw,
>>                             IN ftree_sw_t * p_prev_sw,
>> @@ -1932,21 +1932,23 @@ 
>> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
>> p_ftree,
>>      uint16_t i;
>>      uint16_t j;
>>      uint16_t k;
>> +    uint8_t created_route=0;
>>  
>>      /* we shouldn't enter here if both real_lid and main_path are 
>> false */
>>      CL_ASSERT(is_real_lid || is_main_path);
>>  
>>      /* if there is no down-going ports */
>>      if (p_sw->down_port_groups_num == 0)
>> -        return;
>> +        return 1;
>
> Shouldn't it return 0?
Probably yes. I was thinking to the case where (taking notations from my 
scheme above) a L0 wouldn't have any CN (beaucse they are shutdown, 
broken, or for future extension). In this case, I think it'll smooth 
things a bit and not desequilibrate the network.
>> -    /* promote the index that indicates which group should we
>> -       start with when going through all the downgoing groups */
>> -    p_sw->down_port_groups_idx =
>> -        (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
>> +    /* If we are on a leaf switch we should be creating routes for 
>> real HCA  */
>> +    /* This flag will be returned  so upper layers will incrementent 
>> shift index */
>> +    if(p_sw->is_leaf == TRUE){
>> +        created_route=1;
>> +    }
>
> The "is_leaf" flag will be TRUE only on leaf switches that have CNs 
> connected to them.
> If we want to solve the problem for all routes (CNs, IO nodes, 
> management nodes),
> the "created_route" flag should be updated elsewhere (see below).
I was not aware of this, so yes it should be done somewhere else (though 
it can also be done here).
>
>>      /* foreach down-going port group (in indexing order) */
>> -    i = p_sw->down_port_groups_idx;
>> +    i = (p_sw->down_port_groups_idx + 1) %  p_sw->down_port_groups_num;
>>      for (k = 0; k < p_sw->down_port_groups_num; k++) {
>
> I think that since p_sw->down_port_groups_idx is promoted below,
> there is no need to increase the starting value of i.
>
I tried to but I had some problem (segfault probably due to the 
%p_sw->down_port_groups_num).
If it works without incrementing, it's fine with me.
>> +    if(created_route)
>> +        p_sw->down_port_groups_idx = +            
>> (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> ...
>> +    return created_route;
>>  }                /* __osm_ftree_fabric_route_upgoing_by_going_down() */
>>  
>>  /***************************************************/
>
> How about something like this:
>
> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN 
> cl_map_item_t * const p_map_item,
>   *        assign-up-going-port-by-descending-down to r-port node 
> (recursion)
>   */
>
> -static void
> +static boolean_t
>  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>                             IN ftree_sw_t * p_sw,
>                             IN ftree_sw_t * p_prev_sw,
> @@ -1932,18 +1932,14 @@ 
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>      uint16_t i;
>      uint16_t j;
>      uint16_t k;
> +    boolean_t created_route = FALSE;
>
>      /* we shouldn't enter here if both real_lid and main_path are 
> false */
>      CL_ASSERT(is_real_lid || is_main_path);
>
>      /* if there is no down-going ports */
>      if (p_sw->down_port_groups_num == 0)
> -        return;
> -
> -    /* promote the index that indicates which group should we
> -       start with when going through all the downgoing groups */
> -    p_sw->down_port_groups_idx =
> -        (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> +        return FALSE;
>
>      /* foreach down-going port group (in indexing order) */
>      i = p_sw->down_port_groups_idx;
> @@ -1952,9 +1948,12 @@ 
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>          p_group = p_sw->down_port_groups[i];
>          i = (i + 1) % p_sw->down_port_groups_num;
>
> -        /* Skip this port group unless it points to a switch */
> -        if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
> +        /* If this port group doesn't point to a switch, mark
> +           that the route was created and skip to the next group */
> +        if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) {
> +            created_route = TRUE;
>              continue;
> +        }
>
>          if (p_prev_sw
>              && (p_group->remote_base_lid == p_prev_sw->base_lid)) {
> @@ -2073,16 +2072,25 @@ 
> __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * 
> p_ftree,
>
>          /* Recursion step:
>             Assign upgoing ports by stepping down, starting on REMOTE 
> switch */
> -        __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, 
> p_remote_sw,    /* remote switch - used as a route-upgoing alg. start 
> point */
> -                                   NULL,    /* prev. position - NULL 
> to mark that we went down and not up */
> -                                   target_lid,    /* LID that we're 
> routing to */
> -                                   target_rank,    /* rank of the LID 
> that we're routing to */
> -                                   is_real_lid,    /* whether the 
> target LID is real or dummy */
> -                                   is_main_path,    /* whether this 
> is path to HCA that should by tracked by counters */
> -                                   highest_rank_in_route);    /* 
> highest visited point in the tree before going down */
> +        created_route |= 
> __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree,
> +            p_remote_sw,    /* remote switch - used as a 
> route-upgoing alg. start point */
> +            NULL,        /* prev. position - NULL to mark that we 
> went down and not up */
> +            target_lid,    /* LID that we're routing to */
> +            target_rank,    /* rank of the LID that we're routing to */
> +            is_real_lid,    /* whether the target LID is real or 
> dummy */
> +            is_main_path,    /* whether this is path to HCA that 
> should by tracked by counters */
> +            highest_rank_in_route);    /* highest visited point in 
> the tree before going down */
>      }
>      /* done scanning all the down-going port groups */
>
> +    /* if the route was created, promote the index that
> +       indicates which group should we start with when
> +       going through all the downgoing groups */
> +    if (created_route)
> +        p_sw->down_port_groups_idx =
> +            (p_sw->down_port_groups_idx + 1) % 
> p_sw->down_port_groups_num;
> +
> +    return created_route;
>  }                /* __osm_ftree_fabric_route_upgoing_by_going_down() */
>
>  /***************************************************/
>
>
>
>
>
>

That seems good.
I'm going to think a bit more about the case where there are no downports.

Best regards

Nicolas




More information about the general mailing list