[ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on index port order incrementation

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Wed Feb 4 02:19:08 PST 2009


Hi Nicolas,

Nicolas Morey Chaisemartin wrote:
> Hello,
> 
> While doing some routing analysis on fat tree using ibsim we found a 
> "bug" in the fat-tree algorithm.
> Problem happens with a 4 level Fat tree as below:
> 
> 
>                          L3  L3
>        ___________________|__|____________________
>       /          /               \               \                <= All 
> the L2 are connected on 2 L3 switches
>    L2-1         L2-2            L2-1           L2-2
>   /             /                 \              \                 <== 
> The Nth L1  of a set leads only to the Nth L2 (L2-N). With some pruning.
>   L1           L1                 L1             L1
>   /|\         /|\                 /|\           /|\
>  ==Fully mixed to L1==          ==Fully mixed to L1==      <=== We have 
> multiple set. In each set, all L0 lead to all L1 of their set.
> 
>    L0           L0                 L0           L0
>  /   \        /    \             /    \       /     \
> CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN
> 
> 
> To detail:
> We have a bunch of sets. Each set contains compute node, L0 and L1 
> switches.
> Plus a common top of L2 and L3 switches.
> 
> In each set, there are groups of compute nodes. Each group is connected 
> to a single L0 switch.
> In a given set, all L0 are connected to all L1.
> 
> The Nth L1 of a set is connected to the Nth L2 and only to this one. (so 
> through a L2, the Nth L1 can only see the Nth L1 of the other sets)
> All the L2 are connected to a couple of L3.
> 
> 
> If we dont put the L3. We have a perfectly equilibrated fat tree and 
> well equilibrated routes.
> But when we add the L3, it introduce a huge difference. As it is not 
> necessary, no route is going through L3 (which is fine).
> However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 
> 1/4 is twice overused (compared to the equilibrate state).
> 
> This comes from the down_port_groups_idx which is incremented each time 
> the algorithm goes down through a node whether it creates routes to HCA 
> (port != switch)
> or not. As route coming up from a L1 reaches only one L2, the algorithm 
> goes through all the other L2 while going down, incrementing their index.
> Our case here is a bit specific but in a case where your L1 doesn't have 
> full connectivity to all your L2, and another switch rank above, the 
> problem may appear.
> 
> To avoid this problem, I've changed the 
> __osm_ftree_fabric_route_upgoing_by_going_down function so it returns a 
> value to indicate if routes to HCA (in fact to leaf switch) were created.
> With this information, we only increase the index when the algorithm has 
> created routes to HCA.
> After applying this patch and measuring the link usage, we are at 
> perfect equilibrium (L2<->L3 links are still not used but that is to be 
> expected).

Great! I've actually seen this problem on a real clusters, but
couldn't understand what's cusing the lack of equilibrity.

See couple of questions below.

> Signed-off-by: Nicolas Morey-Chaisemartin 
> <nicolas.morey-chaisemartin at ext.bull.net>
> ---
>  opensm/opensm/osm_ucast_ftree.c |   23 ++++++++++++++---------
>  1 files changed, 14 insertions(+), 9 deletions(-)
> 
> ------------------------------------------------------------------------
> 
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index ebe6612..3474876 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
>   *        assign-up-going-port-by-descending-down to r-port node (recursion)
>   */
>  
> -static void
> +static int
>  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  					       IN ftree_sw_t * p_sw,
>  					       IN ftree_sw_t * p_prev_sw,
> @@ -1932,21 +1932,23 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
>  	uint16_t i;
>  	uint16_t j;
>  	uint16_t k;
> +	uint8_t created_route=0;
>  
>  	/* we shouldn't enter here if both real_lid and main_path are false */
>  	CL_ASSERT(is_real_lid || is_main_path);
>  
>  	/* if there is no down-going ports */
>  	if (p_sw->down_port_groups_num == 0)
> -		return;
> +		return 1;

Shouldn't it return 0?

> -	/* promote the index that indicates which group should we
> -	   start with when going through all the downgoing groups */
> -	p_sw->down_port_groups_idx =
> -		(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
> +	/* If we are on a leaf switch we should be creating routes for real HCA  */
> +	/* This flag will be returned  so upper layers will incrementent shift index */
> +	if(p_sw->is_leaf == TRUE){
> +		created_route=1;
> +	}

The "is_leaf" flag will be TRUE only on leaf switches that have CNs connected to them.
If we want to solve the problem for all routes (CNs, IO nodes, management nodes),
the "created_route" flag should be updated elsewhere (see below).

>  	/* foreach down-going port group (in indexing order) */
> -	i = p_sw->down_port_groups_idx;
> +	i = (p_sw->down_port_groups_idx + 1) %  p_sw->down_port_groups_num;
>  	for (k = 0; k < p_sw->down_port_groups_num; k++) {

I think that since p_sw->down_port_groups_idx is promoted below,
there is no need to increase the starting value of i.

> +	if(created_route)
> +		p_sw->down_port_groups_idx = 
> +			(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
...
> +	return created_route;
>  }				/* __osm_ftree_fabric_route_upgoing_by_going_down() */
>  
>  /***************************************************/

How about something like this:

@@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
   *        assign-up-going-port-by-descending-down to r-port node (recursion)
   */

-static void
+static boolean_t
  __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  					       IN ftree_sw_t * p_sw,
  					       IN ftree_sw_t * p_prev_sw,
@@ -1932,18 +1932,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  	uint16_t i;
  	uint16_t j;
  	uint16_t k;
+	boolean_t created_route = FALSE;

  	/* we shouldn't enter here if both real_lid and main_path are false */
  	CL_ASSERT(is_real_lid || is_main_path);

  	/* if there is no down-going ports */
  	if (p_sw->down_port_groups_num == 0)
-		return;
-
-	/* promote the index that indicates which group should we
-	   start with when going through all the downgoing groups */
-	p_sw->down_port_groups_idx =
-		(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+		return FALSE;

  	/* foreach down-going port group (in indexing order) */
  	i = p_sw->down_port_groups_idx;
@@ -1952,9 +1948,12 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
  		p_group = p_sw->down_port_groups[i];
  		i = (i + 1) % p_sw->down_port_groups_num;

-		/* Skip this port group unless it points to a switch */
-		if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
+		/* If this port group doesn't point to a switch, mark
+		   that the route was created and skip to the next group */
+		if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) {
+			created_route = TRUE;
  			continue;
+		}

  		if (p_prev_sw
  		    && (p_group->remote_base_lid == p_prev_sw->base_lid)) {
@@ -2073,16 +2072,25 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,

  		/* Recursion step:
  		   Assign upgoing ports by stepping down, starting on REMOTE switch */
-		__osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
-							       NULL,	/* prev. position - NULL to mark that we went down and not up */
-							       target_lid,	/* LID that we're routing to */
-							       target_rank,	/* rank of the LID that we're routing to */
-							       is_real_lid,	/* whether the target LID is real or dummy */
-							       is_main_path,	/* whether this is path to HCA that should by tracked by counters */
-							       highest_rank_in_route);	/* highest visited point in the tree before going down */
+		created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree,
+			p_remote_sw,	/* remote switch - used as a route-upgoing alg. start point */
+			NULL,		/* prev. position - NULL to mark that we went down and not up */
+			target_lid,	/* LID that we're routing to */
+			target_rank,	/* rank of the LID that we're routing to */
+			is_real_lid,	/* whether the target LID is real or dummy */
+			is_main_path,	/* whether this is path to HCA that should by tracked by counters */
+			highest_rank_in_route);	/* highest visited point in the tree before going down */
  	}
  	/* done scanning all the down-going port groups */

+	/* if the route was created, promote the index that
+	   indicates which group should we start with when
+	   going through all the downgoing groups */
+	if (created_route)
+		p_sw->down_port_groups_idx =
+			(p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+
+	return created_route;
  }				/* __osm_ftree_fabric_route_upgoing_by_going_down() */

  /***************************************************/







More information about the general mailing list