[openib-general] Re: OpenSM Routing Algorithms Scalability and Enhancements

Mon Sep 26 08:52:23 PDT 2005

Hi Eitan,

I finally got a chance to read this over. Here are some comments:

On Tue, 2005-09-06 at 15:18, Eitan Zahavi wrote: 
> Hi All,
> 
> As we are about to start working on the fast routing algorithms,
> here is the writeup about proposed algorithms for your review.

This appears to be an update to what was publshed last May.

> The plan is to start development once the merge of 1.8.0 into the
> trunk is done.

> 
> ______________________________________________________________________
>               OpenSM Unicast Routing Enhancements for Scalability
>              =====================================================
> 
> Authors:Eitan Zahavi <eitan at mellanox.co.il>, Yael Kalka <yael at mellanox.co.il>
> Date:   Aug 2005.
> 
> Table of contents:
> 1. Overview
> 2. Notation
> 3. Current Algorithms
> 4. Proposed Routing Algorithms
> 5. Min Hop Tables Implementation
> 6. Incremental Routing
> 7. Routing Persistancy
> 
> 1. Overview:
> --
> 
> OpenSM currently uses a two stage routing algorithm for unicast
> forwarding tables calculation. As shown later these algorithm are
> O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on 1100
> nodes cluster.

On what processor architecture (and what speed was the CPU) ? Also
how much memory was in that machine (memory consumption of OpenSM) ?

>  The purpose of this memo is to present the community
> the proposed work for enhancing OpenSM routing engine.
> 
> 2. Notation:
> --
> The following notations are used throughout this document:
> S = Number of switch devices in the system
> P = Number of ports each switch node has
> H = Number of HCA ports connected to the fabric
> L = Number of HCAs connected to each Leaf switch device. 
>     Normal values are 1/2P to 3/4P

Is L the proportion of leaf switch ports connected to HCAs rather than
to other switches ? That's what it looks like from 1/2P -> 3/4P.

> D = Fat Tree depth
> 
> 3. Current Algorithms:
> --
> OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both
> of them do not scale with cluster size and can consume both large run
> time (minutes) and memory (GB). This section provides meta code for
> these algorithms and order calculation.
> 
> 3.1 Min Hop algorithm analysis:
> The Min Hop algorithm is divided into two stages: computation of
> min-hop tables on every switch and LFT output port assignment. 
> 
> Step 1: Computation of Min-Hop Tables on each switch
> 
> The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with
> 2500 switch devices this ends up as 812M-Byte (using LMC=0).
> 
> Meta algorithm:
> For each HCA mark its remote neighbor switch port with hop 1.
> For each switch mark itself port 0 as hop 0
> While changes 
>   For each switch  
>      For each port 
>        For each LID 
>          Propagate remote port hop as hop +1 if smaller or undefined
> 
> The order of this step: O(S*P*(S+H)*(D+1))
> 
> Step 2: Assigning output port:
> For each switch 
>   For each LID 
>      For each Port 
>         Is it the one with min count
> 
> The order of this step: O(S*(S+H)*P)
> 
> 3.2 Up/Down algorithm analysis
> The Up/Down algorithm depends on the ability to rank the fabric nodes
> from root to leaf of the tree. To get that ranking it runs a
> heuristics that is based on the Min Hop tables. So the memory and 
> complexity are identical to the Min-Hop first step to start with.
>  
> Once ranking is performed the algorithm is BFS from every HCA and fill
> in the Min Hop tables again. Up/Down traversal rule is enforced during
> the BFS such that only valid turns are allowed.
> 
> Meta algorithm:
> For each HCA 
>   Get connected Switch 
>      For each Switch in NextSwitches 
>         For each Port 
>            Check if direction is OK. Check if not visited
> 
> The order is O(H*S*P)
> 
> To finalize output port assignment the second step of the Min Hop
> algorithm is invoked. 
> 
> 
> 4. Proposed Algorithm:
> --
> 
> Inspecting the routing problem we have noticed the following
> attributes:
> a. Using Min-Hop tables for keeping intermediate routing information
>    has a disadvantage in terms of memory consumption. However, any
>    incremental routing algorithm (for handling fabric changes after
>    first setup), or routing persistence solution could use this
>    information and gain speed.
> b. Since we need to fill in LFT tables that are of the order S*(S+H)
>    the algorithm is lower bounded by O(H^2).

Is the lower bound O(S^2) ?

> c. A persistence based solution which uses previously routed fabric
>    data and is able to handle simple incremental changes will provide
>    a much faster runtime as topology match will require O(S*P)
>    (traversing all links once)
> d. Since the minimum hops information is identical for a switch and
>    all the HCAs connected to it - there is no point in building "min
>    hop" tables for HCAs. During the "output port" assignment stage,
>    the HCAs connected to each switch are considered and routed.

Yes, but the actual min hops value is different for the HCA and switch
but I don't think that matters. It is the output port for min hops that
matters here..

I presume wherever HCA is in this would also apply to router ports.

Also, what about switch port 0 routing ?

> The result of "a" is that several algorithms that are superior from
> memory footprint and skip any "hop table" stage are not considered for
> implementation.
> 
> To support "d" we needed to provide an index to each switch such that
> the "min hop" tables are dense (previously they were indexed by LID).
> The new index is stored on the switch object and thus allow lookup of
> a switch "min hop" by its index. An array of switch pointer by index
> supports the reverse lookup.

Seems like array matching is might be inefficient in a large switch
network. I'm sure this is in more places in OpenSM. Should these be
replaced by some other data structure and search ? [This is a general
OpenSM question but comes up in this context.]

> The suggested algorithm is broken into the following 3 stages:
> * Root nodes identification heuristics
> * Min Hop tables computation
> * Output port assignment
> 
> 4.1 Root nodes identification heuristics:
> This step is only required under the AND of the following two
> conditions: 
> * Up/Down routing is required
> * The user does not provide a file with guids of the tree "root nodes" 

What is the impact of poor root node selection/choice ? How many root
nodes are there ?
> This heuristics for recognizing the tree roots is based on histograms
> of the HCAs distance from every switch.
> I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to fill
> in these histograms on all switches we need to BFS from every leaf
> switch and propagate the number of HCAs connected to it:
> 
> Meta algorithm:
> For each switch 
>   For each Port 
>     If connected to HCAs count them 
>     If any HCA 
>      Init BFS to start with current switch 
>      set hop count to 0 
>      While there are switches in BFS list 
>        increment hop count 
>        For each switch in BFS List 
>          Add the number of HCAs to the histogram at the current hop count 
>         For each port 
>          If remote port switch not visited 
>           Add the switch to the BFS Next Step List 
>       Once finished all this step list use next steps

I didn't follow the meta algorithm but I'm not sure it matters right now.
> The order of this step is: O(S*P + H/L*S*P) = O(*H*S)

                                                  ^^^^^?
This looks to me like O((1+H/L)*S*P). Not sure how that reduces to
O(*H*S) but it looks like something might be missing there.

> 4.2 Min Hop tables computation:
> This step is mandatory and has a slightly different flavor for the
> case of Up/Down routing.
> 
> The algorithm starts from every leaf switch and traverses BFS wise
> through the fabric. 
> 
> Meta algorithm:
> foreach switch in the fabric
>  clear the "Rank" vector for all switches.
>  start BFS with the given switch.
>  set rank to 0
>  while any switch in BFS list
>  | foreach switch in BFS switch list
>  | |foreach port (valid, active, not unhealthy)
>  | | if remote side is a switch: 
>  | |  if rank of remote side 0 or = rank + 1
>  | |    set the remote port entry MinHopPort for this switch
>  | |  if rank of remote side 0 i.e. never visited
>  | |    set the remote switch rank to rank + 1
>  | |    add the remote switch to next BFS switches
>  | |--
>  | switch between the current and next switches list
>  | increment rank
>  |--
> 
> The order of this algorithm: O(P*S^2)
> 
> Algorithm that merges Up/Down step criteria not yet written for this
> stage. But the idea is to make each step keep track of any previous
> step down. Such that a step up will be prohibited in this case.
> 
> 4.3 Output port assignment:
> This step provide actual LFT values assignment to each switch.
> To do that we access the built "min hop" tables and track port usage.
> 
> Meta algorithm:
> foreach switch in the fabric
>  clear the port subscription vector (track number of paths subscribed)
>  foreach target switch in the "min hop" table
>   get the list of min-hop ports
>   foreach end-node attached (HCA connected to it and itself)
>    if lmc > 0 init tracking of remote system and node for selecting
>       disjoint paths for same end-node different LID LSBs

What are the policies for disjoint paths ? I can envision several.
>    get min-subscribed (and disjoint) port marked min-hop target switch
>    track port usage in port-subscription (opt. if target LID is not a switch)
> 
> Order of this step: 
> Currently the selection of the output port by min-subscription is
> trivial and requires O(P) so the overall order is 
> O(S*S*(1+L)*P) <= O(16*P*S^2)
> 
> One could obtain the list of marked min-hop ports and then use a
> modified cyclic list for avoiding the search for min subscription in
> the case of LMC > 0. In that case the order could be reduced to:
> O(S*S*(1+L)) ~= O(S*H)
> 
> 5. Min Hop Table Implementation:
> --
> The proposed algorithm does not require storing the number of hops
> arriving at the switch port - but only the fact a port is on the min
> hop path. This allowed for another memory usage reduction if the min
> hop table would be of boolean values. 
> 
> The issue then is in an efficiant iterator on the boolean (bits)
> array. The tradeoff is thus the common memory versus runtime.
> 
> (Anybody knowns off a fast boolean array lookup implementation ?)
> 
> 6. Incremental Routing:
> --
> Once the fabric is routed we can define an algorithm for performing
> incremental routing changes. An obvious case is when a link is
> declared un-healthy or one of the ports is dropped. Assuming the
> recognition of the change is done by some other algorithm. The
> following cases apply: 
> *	If the link connects HCA and a switch the HCA is unreachable. No
>    routing change required. 
> *	If the link is between switches:
>    * If there at least one another link between these switches:  
>      o Spread all routes going through the failing port to the other
>        ports connecting to the same switch.

Right, but LMC needs to be "honored".
>    * If there is no other link to these switches
>      o Go back to all switches that feed into each one of the switches
>        (feed in means they route some target lids through the switch)
>        but only those that route lids that go through the failing port. Check
>        to see if there is another port that goes to a different switch
>        to route that lid to. If there is no other way do nothing.

When multipathing is no longer multipath, it would be nice to inform NM.
> How do we support topology changes line moving an HCA from one Switch
> to another? 

Also, what about addition of new switches and HCAs ?

What about subnet merge ?

> 7. Routing Persistancy:
> --
> To make the subnet initialization faster, one could store the existing
> routing solution and use it without any calculation. 
> 
> The issue is of course what conditions makes the stored routing
> obsolete. To maximize the usefullnes of the stored information we
> propose to store the Min Hop tables rather then the final port
> assignment. It is assumed that after restart there might be a need for
> modifications to LMC and routing which will invalidate the LFT
> anyways. To enable "cache invalidation criteria" the persistent
> database should include information that could be used to easily check
> if the fabric was not altered in a way that invalidates the MinHop tables. 
> The stored information should hold for each switch in the fabric (by guid)
> the list of ports and the guids and port numbers on the remote side. 
> To validate there are no significant changes, the discovered set
> of switches is checked to match the stored information.

Right, it's only the switches that matter (and their connectivity: that
links between switches are active )and healthy)).

A related implementation issue is the format of the stored information.

>  Table 1
> describes the possible changes and their effect on the  validity of
> the MinHop tables. 
> 
> Table 1 - Connectivity Changes effect on Routing Info Validity
> 
> Change           | Effect on MinHop Tables	 | Effect on LFT and MFT   |
> --
> New Switch found | Invalidates (might connect | Invalidates             |
>                  | more HCAs and carries more |                         |
>                  | routing resources)         |                         |
> --
> Missing Switch   | Invalidates (MinHops might | Invalidates             |
>                  | be broken a few steps away)|                         |
> --
> New cable found  | Does not invalidate        | Does not invalidate     |
> --

Assuming this is interswitch links, that could be ignored (which is what
I think you are saying by does not invalidate) even if the new link(s)
could improve the routing. If this is the case, does not need to
invalidate might be more accurate here.

> Missing Cable    | Invalidates only if there  | Invalidates all LIDs    |
> (SW to SW)       | is no other cable          | going through that port |
>                  | connecting the switches    |                         |
> --

Might affect multipathing.
> Missing Cable    | Does no invalidate         | Does not invalidate     |
> (SW to HCA)      |                            |                         |
> --
> New HCA          | Does no invalidate         | Does not invalidate     |
> --
> Missing HCA      | Does no invalidate         | Does not invalidate     |
> --
> LID Changes      | Does no invalidate         | Invalidates the modified|
>                  |                            | LIDs                    |
> --
> 
> Special marking for "root nodes" shold cache the results of the first
> step for Up/Dpwn routing. These nodes should be invalidated on any
> missing or additional switch conditions.

It appears that (SM) failover is totally independent of all this. Is
that true ?

Possible some more on this later...

-- Hal