[openib-general] Dump and load routes with opensm?

Mon May 8 06:27:33 PDT 2006

Hi Eitan,

On Sun, 2006-05-07 at 01:31, Eitan Zahavi wrote:
> Hi Greg,
> 
> A plan for OpenSM to support loading unicast routes already exists.

What's the timeframe for implementing this ? Has the implementation
already started ? Is it what you write below and had posted to the list
last summer ?

> To do it we need to develop a scheme where the routes file also holds
> the topology using GUIDs
> such that the discovered topology is compared to the saved one. I have
> outlined the algorithm in
> the attached writeup on OpenSM routing algorithms section 6:
> "Incremental Algorithms".

Wouldn't it be better to separate out the persistency and incremental
issues ?

Also, I sent numerous questions and comments on this writeup some time
ago.

> The right place to implement this will be to have the osm_db* enhanced
> to support this new DB domain.
> Then the osm_ucast_mgr.c will need to initialize the DB and use it while
> routing.

Also, changes versus this either need disabling or handling.

-- Hal

> Regarding dump of existing tables:
> 
> If you run opensm -V or -D 0x43 you should get the file /tmp/osm.fdbs
> with that dump.
> You can also use a more SM independent method for obtaining the tables :
> if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs with
> similar format.
> 
> Eitan
> 
> Greg Johnson wrote:
> > On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote:
> > 
> >>Hi Greg,
> >>
> >>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote:
> >>
> >>>Is there currently a way to dump and load routes with opensm?  If
> not,
> >>>how would I go about writing one?
> >>
> >>Is it really routes or stable LIDs you want ?
> > 
> > 
> > I actually want routes.  I have queried them with ibtraceroute and
> > ibroute, but we need routes for the whole fabric. 
> > 
> > BTW, if you call ibtraceroute thousands of times it stops working.
> > Maybe a problem in the MAD driver?
> > 
> > 
> > 
> >>LIDs are stored in /var/cache/osm/guid2lid and restored from there
> when
> >>OpenSM is started assuming the reassign LIDs option (-r or
> >>--reassign_lids) is not used when invoking OpenSM.
> > 
> > 
> > Thanks, that's good to know.
> > 
> > Greg
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> 
> 
> 
> ______________________________________________________________________
> 
>               OpenSM Unicast Routing Enhancements for Scalability
>              =====================================================
> 
> Authors:Eitan Zahavi <eitan at mellanox.co.il>, Yael Kalka <yael at mellanox.co.il>
> Date:   Aug 2005.
> 
> Table of contents:
> 1. Overview
> 2. Notation
> 3. Current Algorithms
> 4. Proposed Routing Algorithms
> 5. Min Hop Tables Implementation
> 6. Incremental Routing
> 7. Routing Persistancy
> 
> 1. Overview:
> ------------
> 
> OpenSM currently uses a two stage routing algorithm for unicast
> forwarding tables calculation. As shown later these algorithm are
> O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on 1100
> nodes cluster. The purpose of this memo is to present the community
> the proposed work for enhancing OpenSM routing engine.
> 
> 2. Notation:
> ------------
> The following notations are used throughout this document:
> S = Number of switch devices in the system
> P = Number of ports each switch node has
> H = Number of HCA ports connected to the fabric
> L = Number of HCAs connected to each Leaf switch device. 
>     Normal values are 1/2P to 3/4P
> D = Fat Tree depth
> 
> 3. Current Algorithms:
> ----------------------
> OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both
> of them do not scale with cluster size and can consume both large run
> time (minutes) and memory (GB). This section provides meta code for
> these algorithms and order calculation.
> 
> 3.1 Min Hop algorithm analysis:
> The Min Hop algorithm is divided into two stages: computation of
> min-hop tables on every switch and LFT output port assignment. 
> 
> Step 1: Computation of Min-Hop Tables on each switch
> 
> The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with
> 2500 switch devices this ends up as 812M-Byte (using LMC=0).
> 
> Meta algorithm:
> For each HCA mark its remote neighbor switch port with hop 1.
> For each switch mark itself port 0 as hop 0
> While changes 
>   For each switch  
>      For each port 
>        For each LID 
>          Propagate remote port hop as hop +1 if smaller or undefined
> 
> The order of this step: O(S*P*(S+H)*(D+1))
> 
> Step 2: Assigning output port:
> For each switch 
>   For each LID 
>      For each Port 
>         Is it the one with min count
> 
> The order of this step: O(S*(S+H)*P)
> 
> 3.2 Up/Down algorithm analysis
> The Up/Down algorithm depends on the ability to rank the fabric nodes
> from root to leaf of the tree. To get that ranking it runs a
> heuristics that is based on the Min Hop tables. So the memory and 
> complexity are identical to the Min-Hop first step to start with.
>  
> Once ranking is performed the algorithm is BFS from every HCA and fill
> in the Min Hop tables again. Up/Down traversal rule is enforced during
> the BFS such that only valid turns are allowed.
> 
> Meta algorithm:
> For each HCA 
>   Get connected Switch 
>      For each Switch in NextSwitches 
>         For each Port 
>            Check if direction is OK. Check if not visited
> 
> The order is O(H*S*P)
> 
> To finalize output port assignment the second step of the Min Hop
> algorithm is invoked. 
> 
> 
> 4. Proposed Algorithm:
> ----------------------
> 
> Inspecting the routing problem we have noticed the following
> attributes:
> a. Using Min-Hop tables for keeping  intermediate routing information
>    has a disadvantage in terms of memory consumption. However, any
>    incremental routing algorithm (for handling fabric changes after
>    first setup), or routing persistence solution could use this
>    information and gain speed.
> b. Since we need to fill in LFT tables that are of the order S*(S+H)
>    the algorithm is lower bounded by O(H^2).
> c. A persistence based solution which uses previously routed fabric
>    data and is able to handle simple incremental changes will provide
>    a much faster runtime as topology match will require O(S*P)
>    (traversing all links once)
> d. Since the minimum hops information is identical for a switch and
>    all the HCAs connected to it - there is no point in building "min
>    hop" tables for HCAs. During the "output port" assignment stage,
>    the HCAs connected to each switch are considered and routed.
> 
> The result of "a" is that several algorithms that are superior from
> memory footprint and skip any "hop table" stage are not considered for
> implementation. 
> 
> To support "d" we needed to provide an index to each switch such that
> the "min hop" tables are dense (previously they were indexed by LID).
> The new index is stored on the switch object and thus allow lookup of
> a switch "min hop" by its index. An array of switch pointer by index
> supports the reverse lookup.
> 
> The suggested algorithm is broken into the following 3 stages:
> * Root nodes identification heuristics
> * Min Hop tables computation
> * Output port assignment
> 
> 4.1 Root nodes identification heuristics:
> This step is only required under the AND of the following two
> conditions: 
> * Up/Down routing is required
> * The user does not provide a file with guids of the tree "root nodes" 
> 
> This heuristics for recognizing the tree roots is based on histograms
> of the HCAs distance from every switch.
> I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to fill
> in these histograms on all switches we need to BFS from every leaf
> switch and propagate the number of HCAs connected to it:
> 
> Meta algorithm:
> For each switch 
>   For each Port 
>     If connected to HCAs count them 
>     If any HCA 
>      Init BFS to start with current switch 
>      set hop count to 0 
>      While there are switches in BFS list 
>        increment hop count 
>        For each switch in BFS List 
>          Add the number of HCAs to the histogram at the current hop count 
>         For each port 
>          If remote port switch not visited 
>           Add the switch to the BFS Next Step List 
>       Once finished all this step list use next steps
> 
> The order of this step is: O(S*P + H/L*S*P) = O(*H*S)
> 
> 4.2 Min Hop tables computation:
> This step is mandatory and has a slightly different flavor for the
> case of Up/Down routing.
> 
> The algorithm starts from every leaf switch and traverses BFS wise
> through the fabric. 
> 
> Meta algorithm:
> foreach switch in the fabric
>  clear the "Rank" vector for all switches.
>  start BFS with the given switch.
>  set rank to 0
>  while any switch in BFS list
>  | foreach switch in BFS switch list
>  | |foreach port (valid, active, not unhealthy)
>  | | if remote side is a switch: 
>  | |  if rank of remote side 0 or = rank + 1
>  | |    set the remote port entry MinHopPort for this switch
>  | |  if rank of remote side 0 i.e. never visited
>  | |    set the remote switch rank to rank + 1
>  | |    add the remote switch to next BFS switches
>  | |------
>  | switch between the current and next switches list
>  | increment rank
>  |------
> 
> The order of this algorithm: O(P*S^2)
> 
> Algorithm that merges Up/Down step criteria not yet written for this
> stage. But the idea is to make each step keep track of any previous
> step down. Such that a step up will be prohibited in this case.
> 
> 4.3 Output port assignment:
> This step provide actual LFT values assignment to each switch.
> To do that we access the built "min hop" tables and track port usage.
> 
> Meta algorithm:
> foreach switch in the fabric
>  clear the port subscription vector (track number of paths subscribed)
>  foreach target switch in the "min hop" table
>   get the list of min-hop ports
>   foreach end-node attached (HCA connected to it and itself)
>    if lmc > 0 init tracking of remote system and node for selecting
>       disjoint paths for same end-node different LID LSBs
>    get min-subscribed (and disjoint) port marked min-hop target switch
>    track port usage in port-subscription (opt. if target LID is not a switch)
> 
> Order of this step: 
> Currently the selection of the output port by min-subscription is
> trivial and requires O(P) so the overall order is 
> O(S*S*(1+L)*P) <= O(16*P*S^2)
> 
> One could obtain the list of marked min-hop ports and then use a
> modified cyclic list for avoiding the search for min subscription in
> the case of LMC > 0. In that case the order could be reduced to:
> O(S*S*(1+L)) ~= O(S*H)
> 
> 5. Min Hop Table Implementation:
> --------------------------------
> The proposed algorithm does not require storing the number of hops
> arriving at the switch port - but only the fact a port is on the min
> hop path. This allowed for another memory usage reduction if the min
> hop table would be of boolean values. 
> 
> The issue then is in an efficiant iterator on the boolean (bits)
> array. The tradeoff is thus the common memory versus runtime.
> 
> (Anybody knowns off a fast boolean array lookup implementation ?)
> 
> 6. Incremental Routing:
> -----------------------
> Once the fabric is routed we can define an algorithm for performing
> incremental routing changes. An obvious case is when a link is
> declared un-healthy or one of the ports is dropped. Assuming the
> recognition of the change is done by some other algorithm. The
> following cases apply: 
> *	If the link connects HCA and a switch the HCA is unreachable. No
>    routing change required. 
> *	If the link is between switches:
>    * If there at least one another link between these switches:  
>      o Spread all routes going through the failing port to the other
>        ports connecting to the same switch.
>    * If there is no other link to these switches
>      o Go back to all switches that feed into each one of the switches
>        (feed in means they route some target lids through the switch)
>        but only those that route lids that go through the failing port. Check
>        to see if there is another port that goes to a different switch
>        to route that lid to. If there is no other way do nothing.
> 
> How do we support topology changes line moving an HCA from one Switch
> to another? 
> 
> 7. Routing Persistancy:
> -----------------------
> To make the subnet initialization faster, one could store the existing
> routing solution and use it without any calculation. 
> 
> The issue is of course what conditions makes the stored routing
> obsolete. To maximize the usefullnes of the stored information we
> propose to store the Min Hop tables rather then the final port
> assignment. It is assumed that after restart there might be a need for
> modifications to LMC and routing which will invalidate the LFT
> anyways. To enable "cache invalidation criteria" the persistent
> database should include information that could be used to easily check
> if the fabric was not altered in a way that invalidates the MinHop tables. 
> 
> The stored information should hold for each switch in the fabric (by guid)
> the list of ports and the guids and port numbers on the remote side. 
> To validate there are no significant changes, the discovered set
> of switches is checked to match the stored information. Table 1
> describes the possible changes and their effect on the  validity of
> the MinHop tables. 
> 
> Table 1 - Connectivity Changes effect on Routing Info Validity
> 
> Change           | Effect on MinHop Tables	 | Effect on LFT and MFT   |
> -------------------------------------------------------------------------
> New Switch found | Invalidates (might connect | Invalidates             |
>                  | more HCAs and carries more |                         |
>                  | routing resources)         |                         |
> -------------------------------------------------------------------------
> Missing Switch   | Invalidates (MinHops might | Invalidates             |
>                  | be broken a few steps away)|                         |
> -------------------------------------------------------------------------
> New cable found  | Does not invalidate        | Does not invalidate     |
> -------------------------------------------------------------------------
> Missing Cable    | Invalidates only if there  | Invalidates all LIDs    |
> (SW to SW)       | is no other cable          | going through that port |
>                  | connecting the switches    |                         |
> -------------------------------------------------------------------------
> Missing Cable    | Does no invalidate         | Does not invalidate     |
> (SW to HCA)      |                            |                         |
> -------------------------------------------------------------------------
> New HCA          | Does no invalidate         | Does not invalidate     |
> -------------------------------------------------------------------------
> Missing HCA      | Does no invalidate         | Does not invalidate     |
> -------------------------------------------------------------------------
> LID Changes      | Does no invalidate         | Invalidates the modified|
>                  |                            | LIDs                    |
> -------------------------------------------------------------------------
> 
> Special marking for "root nodes" shold cache the results of the first
> step for Up/Dpwn routing. These nodes should be invalidated on any
> missing or additional switch conditions.
> 
>