OpenSM Unicast Routing Enhancements for Scalability ===================================================== Authors:Eitan Zahavi , Yael Kalka Date: Aug 2005. Table of contents: 1. Overview 2. Notation 3. Current Algorithms 4. Proposed Routing Algorithms 5. Min Hop Tables Implementation 6. Incremental Routing 7. Routing Persistancy 1. Overview: ------------ OpenSM currently uses a two stage routing algorithm for unicast forwarding tables calculation. As shown later these algorithm are O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on 1100 nodes cluster. The purpose of this memo is to present the community the proposed work for enhancing OpenSM routing engine. 2. Notation: ------------ The following notations are used throughout this document: S = Number of switch devices in the system P = Number of ports each switch node has H = Number of HCA ports connected to the fabric L = Number of HCAs connected to each Leaf switch device. Normal values are 1/2P to 3/4P D = Fat Tree depth 3. Current Algorithms: ---------------------- OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both of them do not scale with cluster size and can consume both large run time (minutes) and memory (GB). This section provides meta code for these algorithms and order calculation. 3.1 Min Hop algorithm analysis: The Min Hop algorithm is divided into two stages: computation of min-hop tables on every switch and LFT output port assignment. Step 1: Computation of Min-Hop Tables on each switch The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with 2500 switch devices this ends up as 812M-Byte (using LMC=0). Meta algorithm: For each HCA mark its remote neighbor switch port with hop 1. For each switch mark itself port 0 as hop 0 While changes For each switch For each port For each LID Propagate remote port hop as hop +1 if smaller or undefined The order of this step: O(S*P*(S+H)*(D+1)) Step 2: Assigning output port: For each switch For each LID For each Port Is it the one with min count The order of this step: O(S*(S+H)*P) 3.2 Up/Down algorithm analysis The Up/Down algorithm depends on the ability to rank the fabric nodes from root to leaf of the tree. To get that ranking it runs a heuristics that is based on the Min Hop tables. So the memory and complexity are identical to the Min-Hop first step to start with. Once ranking is performed the algorithm is BFS from every HCA and fill in the Min Hop tables again. Up/Down traversal rule is enforced during the BFS such that only valid turns are allowed. Meta algorithm: For each HCA Get connected Switch For each Switch in NextSwitches For each Port Check if direction is OK. Check if not visited The order is O(H*S*P) To finalize output port assignment the second step of the Min Hop algorithm is invoked. 4. Proposed Algorithm: ---------------------- Inspecting the routing problem we have noticed the following attributes: a. Using Min-Hop tables for keeping intermediate routing information has a disadvantage in terms of memory consumption. However, any incremental routing algorithm (for handling fabric changes after first setup), or routing persistence solution could use this information and gain speed. b. Since we need to fill in LFT tables that are of the order S*(S+H) the algorithm is lower bounded by O(H^2). c. A persistence based solution which uses previously routed fabric data and is able to handle simple incremental changes will provide a much faster runtime as topology match will require O(S*P) (traversing all links once) d. Since the minimum hops information is identical for a switch and all the HCAs connected to it - there is no point in building "min hop" tables for HCAs. During the "output port" assignment stage, the HCAs connected to each switch are considered and routed. The result of "a" is that several algorithms that are superior from memory footprint and skip any "hop table" stage are not considered for implementation. To support "d" we needed to provide an index to each switch such that the "min hop" tables are dense (previously they were indexed by LID). The new index is stored on the switch object and thus allow lookup of a switch "min hop" by its index. An array of switch pointer by index supports the reverse lookup. The suggested algorithm is broken into the following 3 stages: * Root nodes identification heuristics * Min Hop tables computation * Output port assignment 4.1 Root nodes identification heuristics: This step is only required under the AND of the following two conditions: * Up/Down routing is required * The user does not provide a file with guids of the tree "root nodes" This heuristics for recognizing the tree roots is based on histograms of the HCAs distance from every switch. I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to fill in these histograms on all switches we need to BFS from every leaf switch and propagate the number of HCAs connected to it: Meta algorithm: For each switch For each Port If connected to HCAs count them If any HCA Init BFS to start with current switch set hop count to 0 While there are switches in BFS list increment hop count For each switch in BFS List Add the number of HCAs to the histogram at the current hop count For each port If remote port switch not visited Add the switch to the BFS Next Step List Once finished all this step list use next steps The order of this step is: O(S*P + H/L*S*P) = O(½*H*S) 4.2 Min Hop tables computation: This step is mandatory and has a slightly different flavor for the case of Up/Down routing. The algorithm starts from every leaf switch and traverses BFS wise through the fabric. Meta algorithm: foreach switch in the fabric clear the "Rank" vector for all switches. start BFS with the given switch. set rank to 0 while any switch in BFS list | foreach switch in BFS switch list | |foreach port (valid, active, not unhealthy) | | if remote side is a switch: | | if rank of remote side 0 or = rank + 1 | | set the remote port entry MinHopPort for this switch | | if rank of remote side 0 i.e. never visited | | set the remote switch rank to rank + 1 | | add the remote switch to next BFS switches | |------ | switch between the current and next switches list | increment rank |------ The order of this algorithm: O(P*S^2) Algorithm that merges Up/Down step criteria not yet written for this stage. But the idea is to make each step keep track of any previous step down. Such that a step up will be prohibited in this case. 4.3 Output port assignment: This step provide actual LFT values assignment to each switch. To do that we access the built "min hop" tables and track port usage. Meta algorithm: foreach switch in the fabric clear the port subscription vector (track number of paths subscribed) foreach target switch in the "min hop" table get the list of min-hop ports foreach end-node attached (HCA connected to it and itself) if lmc > 0 init tracking of remote system and node for selecting disjoint paths for same end-node different LID LSBs get min-subscribed (and disjoint) port marked min-hop target switch track port usage in port-subscription (opt. if target LID is not a switch) Order of this step: Currently the selection of the output port by min-subscription is trivial and requires O(P) so the overall order is O(S*S*(1+L)*P) <= O(16*P*S^2) One could obtain the list of marked min-hop ports and then use a modified cyclic list for avoiding the search for min subscription in the case of LMC > 0. In that case the order could be reduced to: O(S*S*(1+L)) ~= O(S*H) 5. Min Hop Table Implementation: -------------------------------- The proposed algorithm does not require storing the number of hops arriving at the switch port - but only the fact a port is on the min hop path. This allowed for another memory usage reduction if the min hop table would be of boolean values. The issue then is in an efficiant iterator on the boolean (bits) array. The tradeoff is thus the common memory versus runtime. (Anybody knowns off a fast boolean array lookup implementation ?) 6. Incremental Routing: ----------------------- Once the fabric is routed we can define an algorithm for performing incremental routing changes. An obvious case is when a link is declared un-healthy or one of the ports is dropped. Assuming the recognition of the change is done by some other algorithm. The following cases apply: * If the link connects HCA and a switch the HCA is unreachable. No routing change required. * If the link is between switches: * If there at least one another link between these switches: o Spread all routes going through the failing port to the other ports connecting to the same switch. * If there is no other link to these switches o Go back to all switches that feed into each one of the switches (feed in means they route some target lids through the switch) but only those that route lids that go through the failing port. Check to see if there is another port that goes to a different switch to route that lid to. If there is no other way do nothing. How do we support topology changes line moving an HCA from one Switch to another? 7. Routing Persistancy: ----------------------- To make the subnet initialization faster, one could store the existing routing solution and use it without any calculation. The issue is of course what conditions makes the stored routing obsolete. To maximize the usefullnes of the stored information we propose to store the Min Hop tables rather then the final port assignment. It is assumed that after restart there might be a need for modifications to LMC and routing which will invalidate the LFT anyways. To enable "cache invalidation criteria" the persistent database should include information that could be used to easily check if the fabric was not altered in a way that invalidates the MinHop tables. The stored information should hold for each switch in the fabric (by guid) the list of ports and the guids and port numbers on the remote side. To validate there are no significant changes, the discovered set of switches is checked to match the stored information. Table 1 describes the possible changes and their effect on the validity of the MinHop tables. Table 1 - Connectivity Changes effect on Routing Info Validity Change | Effect on MinHop Tables | Effect on LFT and MFT | ------------------------------------------------------------------------- New Switch found | Invalidates (might connect | Invalidates | | more HCAs and carries more | | | routing resources) | | ------------------------------------------------------------------------- Missing Switch | Invalidates (MinHops might | Invalidates | | be broken a few steps away)| | ------------------------------------------------------------------------- New cable found | Does not invalidate | Does not invalidate | ------------------------------------------------------------------------- Missing Cable | Invalidates only if there | Invalidates all LIDs | (SW to SW) | is no other cable | going through that port | | connecting the switches | | ------------------------------------------------------------------------- Missing Cable | Does no invalidate | Does not invalidate | (SW to HCA) | | | ------------------------------------------------------------------------- New HCA | Does no invalidate | Does not invalidate | ------------------------------------------------------------------------- Missing HCA | Does no invalidate | Does not invalidate | ------------------------------------------------------------------------- LID Changes | Does no invalidate | Invalidates the modified| | | LIDs | ------------------------------------------------------------------------- Special marking for "root nodes" shold cache the results of the first step for Up/Dpwn routing. These nodes should be invalidated on any missing or additional switch conditions.