[openib-general] Dump and load routes with opensm?
Hal Rosenstock
halr at voltaire.com
Mon May 8 06:27:33 PDT 2006
Hi Eitan,
On Sun, 2006-05-07 at 01:31, Eitan Zahavi wrote:
> Hi Greg,
>
> A plan for OpenSM to support loading unicast routes already exists.
What's the timeframe for implementing this ? Has the implementation
already started ? Is it what you write below and had posted to the list
last summer ?
> To do it we need to develop a scheme where the routes file also holds
> the topology using GUIDs
> such that the discovered topology is compared to the saved one. I have
> outlined the algorithm in
> the attached writeup on OpenSM routing algorithms section 6:
> "Incremental Algorithms".
Wouldn't it be better to separate out the persistency and incremental
issues ?
Also, I sent numerous questions and comments on this writeup some time
ago.
> The right place to implement this will be to have the osm_db* enhanced
> to support this new DB domain.
> Then the osm_ucast_mgr.c will need to initialize the DB and use it while
> routing.
Also, changes versus this either need disabling or handling.
-- Hal
> Regarding dump of existing tables:
>
> If you run opensm -V or -D 0x43 you should get the file /tmp/osm.fdbs
> with that dump.
> You can also use a more SM independent method for obtaining the tables :
> if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs with
> similar format.
>
> Eitan
>
> Greg Johnson wrote:
> > On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote:
> >
> >>Hi Greg,
> >>
> >>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote:
> >>
> >>>Is there currently a way to dump and load routes with opensm? If
> not,
> >>>how would I go about writing one?
> >>
> >>Is it really routes or stable LIDs you want ?
> >
> >
> > I actually want routes. I have queried them with ibtraceroute and
> > ibroute, but we need routes for the whole fabric.
> >
> > BTW, if you call ibtraceroute thousands of times it stops working.
> > Maybe a problem in the MAD driver?
> >
> >
> >
> >>LIDs are stored in /var/cache/osm/guid2lid and restored from there
> when
> >>OpenSM is started assuming the reassign LIDs option (-r or
> >>--reassign_lids) is not used when invoking OpenSM.
> >
> >
> > Thanks, that's good to know.
> >
> > Greg
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
>
>
> ______________________________________________________________________
>
> OpenSM Unicast Routing Enhancements for Scalability
> =====================================================
>
> Authors:Eitan Zahavi <eitan at mellanox.co.il>, Yael Kalka <yael at mellanox.co.il>
> Date: Aug 2005.
>
> Table of contents:
> 1. Overview
> 2. Notation
> 3. Current Algorithms
> 4. Proposed Routing Algorithms
> 5. Min Hop Tables Implementation
> 6. Incremental Routing
> 7. Routing Persistancy
>
> 1. Overview:
> ------------
>
> OpenSM currently uses a two stage routing algorithm for unicast
> forwarding tables calculation. As shown later these algorithm are
> O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on 1100
> nodes cluster. The purpose of this memo is to present the community
> the proposed work for enhancing OpenSM routing engine.
>
> 2. Notation:
> ------------
> The following notations are used throughout this document:
> S = Number of switch devices in the system
> P = Number of ports each switch node has
> H = Number of HCA ports connected to the fabric
> L = Number of HCAs connected to each Leaf switch device.
> Normal values are 1/2P to 3/4P
> D = Fat Tree depth
>
> 3. Current Algorithms:
> ----------------------
> OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both
> of them do not scale with cluster size and can consume both large run
> time (minutes) and memory (GB). This section provides meta code for
> these algorithms and order calculation.
>
> 3.1 Min Hop algorithm analysis:
> The Min Hop algorithm is divided into two stages: computation of
> min-hop tables on every switch and LFT output port assignment.
>
> Step 1: Computation of Min-Hop Tables on each switch
>
> The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with
> 2500 switch devices this ends up as 812M-Byte (using LMC=0).
>
> Meta algorithm:
> For each HCA mark its remote neighbor switch port with hop 1.
> For each switch mark itself port 0 as hop 0
> While changes
> For each switch
> For each port
> For each LID
> Propagate remote port hop as hop +1 if smaller or undefined
>
> The order of this step: O(S*P*(S+H)*(D+1))
>
> Step 2: Assigning output port:
> For each switch
> For each LID
> For each Port
> Is it the one with min count
>
> The order of this step: O(S*(S+H)*P)
>
> 3.2 Up/Down algorithm analysis
> The Up/Down algorithm depends on the ability to rank the fabric nodes
> from root to leaf of the tree. To get that ranking it runs a
> heuristics that is based on the Min Hop tables. So the memory and
> complexity are identical to the Min-Hop first step to start with.
>
> Once ranking is performed the algorithm is BFS from every HCA and fill
> in the Min Hop tables again. Up/Down traversal rule is enforced during
> the BFS such that only valid turns are allowed.
>
> Meta algorithm:
> For each HCA
> Get connected Switch
> For each Switch in NextSwitches
> For each Port
> Check if direction is OK. Check if not visited
>
> The order is O(H*S*P)
>
> To finalize output port assignment the second step of the Min Hop
> algorithm is invoked.
>
>
> 4. Proposed Algorithm:
> ----------------------
>
> Inspecting the routing problem we have noticed the following
> attributes:
> a. Using Min-Hop tables for keeping intermediate routing information
> has a disadvantage in terms of memory consumption. However, any
> incremental routing algorithm (for handling fabric changes after
> first setup), or routing persistence solution could use this
> information and gain speed.
> b. Since we need to fill in LFT tables that are of the order S*(S+H)
> the algorithm is lower bounded by O(H^2).
> c. A persistence based solution which uses previously routed fabric
> data and is able to handle simple incremental changes will provide
> a much faster runtime as topology match will require O(S*P)
> (traversing all links once)
> d. Since the minimum hops information is identical for a switch and
> all the HCAs connected to it - there is no point in building "min
> hop" tables for HCAs. During the "output port" assignment stage,
> the HCAs connected to each switch are considered and routed.
>
> The result of "a" is that several algorithms that are superior from
> memory footprint and skip any "hop table" stage are not considered for
> implementation.
>
> To support "d" we needed to provide an index to each switch such that
> the "min hop" tables are dense (previously they were indexed by LID).
> The new index is stored on the switch object and thus allow lookup of
> a switch "min hop" by its index. An array of switch pointer by index
> supports the reverse lookup.
>
> The suggested algorithm is broken into the following 3 stages:
> * Root nodes identification heuristics
> * Min Hop tables computation
> * Output port assignment
>
> 4.1 Root nodes identification heuristics:
> This step is only required under the AND of the following two
> conditions:
> * Up/Down routing is required
> * The user does not provide a file with guids of the tree "root nodes"
>
> This heuristics for recognizing the tree roots is based on histograms
> of the HCAs distance from every switch.
> I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to fill
> in these histograms on all switches we need to BFS from every leaf
> switch and propagate the number of HCAs connected to it:
>
> Meta algorithm:
> For each switch
> For each Port
> If connected to HCAs count them
> If any HCA
> Init BFS to start with current switch
> set hop count to 0
> While there are switches in BFS list
> increment hop count
> For each switch in BFS List
> Add the number of HCAs to the histogram at the current hop count
> For each port
> If remote port switch not visited
> Add the switch to the BFS Next Step List
> Once finished all this step list use next steps
>
> The order of this step is: O(S*P + H/L*S*P) = O(*H*S)
>
> 4.2 Min Hop tables computation:
> This step is mandatory and has a slightly different flavor for the
> case of Up/Down routing.
>
> The algorithm starts from every leaf switch and traverses BFS wise
> through the fabric.
>
> Meta algorithm:
> foreach switch in the fabric
> clear the "Rank" vector for all switches.
> start BFS with the given switch.
> set rank to 0
> while any switch in BFS list
> | foreach switch in BFS switch list
> | |foreach port (valid, active, not unhealthy)
> | | if remote side is a switch:
> | | if rank of remote side 0 or = rank + 1
> | | set the remote port entry MinHopPort for this switch
> | | if rank of remote side 0 i.e. never visited
> | | set the remote switch rank to rank + 1
> | | add the remote switch to next BFS switches
> | |------
> | switch between the current and next switches list
> | increment rank
> |------
>
> The order of this algorithm: O(P*S^2)
>
> Algorithm that merges Up/Down step criteria not yet written for this
> stage. But the idea is to make each step keep track of any previous
> step down. Such that a step up will be prohibited in this case.
>
> 4.3 Output port assignment:
> This step provide actual LFT values assignment to each switch.
> To do that we access the built "min hop" tables and track port usage.
>
> Meta algorithm:
> foreach switch in the fabric
> clear the port subscription vector (track number of paths subscribed)
> foreach target switch in the "min hop" table
> get the list of min-hop ports
> foreach end-node attached (HCA connected to it and itself)
> if lmc > 0 init tracking of remote system and node for selecting
> disjoint paths for same end-node different LID LSBs
> get min-subscribed (and disjoint) port marked min-hop target switch
> track port usage in port-subscription (opt. if target LID is not a switch)
>
> Order of this step:
> Currently the selection of the output port by min-subscription is
> trivial and requires O(P) so the overall order is
> O(S*S*(1+L)*P) <= O(16*P*S^2)
>
> One could obtain the list of marked min-hop ports and then use a
> modified cyclic list for avoiding the search for min subscription in
> the case of LMC > 0. In that case the order could be reduced to:
> O(S*S*(1+L)) ~= O(S*H)
>
> 5. Min Hop Table Implementation:
> --------------------------------
> The proposed algorithm does not require storing the number of hops
> arriving at the switch port - but only the fact a port is on the min
> hop path. This allowed for another memory usage reduction if the min
> hop table would be of boolean values.
>
> The issue then is in an efficiant iterator on the boolean (bits)
> array. The tradeoff is thus the common memory versus runtime.
>
> (Anybody knowns off a fast boolean array lookup implementation ?)
>
> 6. Incremental Routing:
> -----------------------
> Once the fabric is routed we can define an algorithm for performing
> incremental routing changes. An obvious case is when a link is
> declared un-healthy or one of the ports is dropped. Assuming the
> recognition of the change is done by some other algorithm. The
> following cases apply:
> * If the link connects HCA and a switch the HCA is unreachable. No
> routing change required.
> * If the link is between switches:
> * If there at least one another link between these switches:
> o Spread all routes going through the failing port to the other
> ports connecting to the same switch.
> * If there is no other link to these switches
> o Go back to all switches that feed into each one of the switches
> (feed in means they route some target lids through the switch)
> but only those that route lids that go through the failing port. Check
> to see if there is another port that goes to a different switch
> to route that lid to. If there is no other way do nothing.
>
> How do we support topology changes line moving an HCA from one Switch
> to another?
>
> 7. Routing Persistancy:
> -----------------------
> To make the subnet initialization faster, one could store the existing
> routing solution and use it without any calculation.
>
> The issue is of course what conditions makes the stored routing
> obsolete. To maximize the usefullnes of the stored information we
> propose to store the Min Hop tables rather then the final port
> assignment. It is assumed that after restart there might be a need for
> modifications to LMC and routing which will invalidate the LFT
> anyways. To enable "cache invalidation criteria" the persistent
> database should include information that could be used to easily check
> if the fabric was not altered in a way that invalidates the MinHop tables.
>
> The stored information should hold for each switch in the fabric (by guid)
> the list of ports and the guids and port numbers on the remote side.
> To validate there are no significant changes, the discovered set
> of switches is checked to match the stored information. Table 1
> describes the possible changes and their effect on the validity of
> the MinHop tables.
>
> Table 1 - Connectivity Changes effect on Routing Info Validity
>
> Change | Effect on MinHop Tables | Effect on LFT and MFT |
> -------------------------------------------------------------------------
> New Switch found | Invalidates (might connect | Invalidates |
> | more HCAs and carries more | |
> | routing resources) | |
> -------------------------------------------------------------------------
> Missing Switch | Invalidates (MinHops might | Invalidates |
> | be broken a few steps away)| |
> -------------------------------------------------------------------------
> New cable found | Does not invalidate | Does not invalidate |
> -------------------------------------------------------------------------
> Missing Cable | Invalidates only if there | Invalidates all LIDs |
> (SW to SW) | is no other cable | going through that port |
> | connecting the switches | |
> -------------------------------------------------------------------------
> Missing Cable | Does no invalidate | Does not invalidate |
> (SW to HCA) | | |
> -------------------------------------------------------------------------
> New HCA | Does no invalidate | Does not invalidate |
> -------------------------------------------------------------------------
> Missing HCA | Does no invalidate | Does not invalidate |
> -------------------------------------------------------------------------
> LID Changes | Does no invalidate | Invalidates the modified|
> | | LIDs |
> -------------------------------------------------------------------------
>
> Special marking for "root nodes" shold cache the results of the first
> step for Up/Dpwn routing. These nodes should be invalidated on any
> missing or additional switch conditions.
>
>
More information about the general
mailing list