[openib-general] Dump and load routes with opensm?
Hal Rosenstock
halr at voltaire.com
Mon May 8 07:08:07 PDT 2006
On Mon, 2006-05-08 at 10:00, Eitan Zahavi wrote:
> > >
> > > A plan for OpenSM to support loading unicast routes already exists.
> >
> > What's the timeframe for implementing this ? Has the implementation
> > already started ? Is it what you write below and had posted to the
> list
> > last summer ?
> [EZ] Yes: it is the same proposal. No: the work has not started.
> >
> > > To do it we need to develop a scheme where the routes file also
> holds
> > > the topology using GUIDs
> > > such that the discovered topology is compared to the saved one. I
> have
> > > outlined the algorithm in
> > > the attached writeup on OpenSM routing algorithms section 6:
> > > "Incremental Algorithms".
> >
> > Wouldn't it be better to separate out the persistency and incremental
> > issues ?
> [EZ] I think that once you load the routes in the file you probably need
> to compare the topology used for generating them with the current one.
Absolutely.
> The complexity level of "incremental" support - i.e. the ability to
> gracefully handle some topology change - might be built with time. So
> first implementation might be limited to compare with reference topology
> and reroute on any change.
The reroute on change might ultimately be the default but there might be
an option to disable this.
> But it is a bit crude in my mind.
Yes. It would allow for the experimentation I think they desire though.
> > Also, I sent numerous questions and comments on this writeup some time
> > ago.
> [EZ] I thought I did answer them.
I don't think so but I need to dig this out of my email. It's been quite
a while...
> > > The right place to implement this will be to have the osm_db*
> enhanced
> > > to support this new DB domain.
> > > Then the osm_ucast_mgr.c will need to initialize the DB and use it
> while
> > > routing.
> >
> > Also, changes versus this either need disabling or handling.
> [EZ] Not following you here. If you mean we need to be able to "shut
> off" this new capability - sure.
I was referring to more than this (in terms of disabling the reroute on
change).
-- Hal
> > -- Hal
> >
> > > Regarding dump of existing tables:
> > >
> > > If you run opensm -V or -D 0x43 you should get the file
> /tmp/osm.fdbs
> > > with that dump.
> > > You can also use a more SM independent method for obtaining the
> tables :
> > > if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs
> with
> > > similar format.
> > >
> > > Eitan
> > >
> > > Greg Johnson wrote:
> > > > On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote:
> > > >
> > > >>Hi Greg,
> > > >>
> > > >>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote:
> > > >>
> > > >>>Is there currently a way to dump and load routes with opensm? If
> > > not,
> > > >>>how would I go about writing one?
> > > >>
> > > >>Is it really routes or stable LIDs you want ?
> > > >
> > > >
> > > > I actually want routes. I have queried them with ibtraceroute and
> > > > ibroute, but we need routes for the whole fabric.
> > > >
> > > > BTW, if you call ibtraceroute thousands of times it stops working.
> > > > Maybe a problem in the MAD driver?
> > > >
> > > >
> > > >
> > > >>LIDs are stored in /var/cache/osm/guid2lid and restored from there
> > > when
> > > >>OpenSM is started assuming the reassign LIDs option (-r or
> > > >>--reassign_lids) is not used when invoking OpenSM.
> > > >
> > > >
> > > > Thanks, that's good to know.
> > > >
> > > > Greg
> > > > _______________________________________________
> > > > openib-general mailing list
> > > > openib-general at openib.org
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > >
> > >
> > >
> > >
> ______________________________________________________________________
> > >
> > > OpenSM Unicast Routing Enhancements for Scalability
> > > =====================================================
> > >
> > > Authors:Eitan Zahavi <eitan at mellanox.co.il>, Yael Kalka
> <yael at mellanox.co.il>
> > > Date: Aug 2005.
> > >
> > > Table of contents:
> > > 1. Overview
> > > 2. Notation
> > > 3. Current Algorithms
> > > 4. Proposed Routing Algorithms
> > > 5. Min Hop Tables Implementation
> > > 6. Incremental Routing
> > > 7. Routing Persistancy
> > >
> > > 1. Overview:
> > > ------------
> > >
> > > OpenSM currently uses a two stage routing algorithm for unicast
> > > forwarding tables calculation. As shown later these algorithm are
> > > O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on
> 1100
> > > nodes cluster. The purpose of this memo is to present the community
> > > the proposed work for enhancing OpenSM routing engine.
> > >
> > > 2. Notation:
> > > ------------
> > > The following notations are used throughout this document:
> > > S = Number of switch devices in the system
> > > P = Number of ports each switch node has
> > > H = Number of HCA ports connected to the fabric
> > > L = Number of HCAs connected to each Leaf switch device.
> > > Normal values are 1/2P to 3/4P
> > > D = Fat Tree depth
> > >
> > > 3. Current Algorithms:
> > > ----------------------
> > > OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both
> > > of them do not scale with cluster size and can consume both large
> run
> > > time (minutes) and memory (GB). This section provides meta code for
> > > these algorithms and order calculation.
> > >
> > > 3.1 Min Hop algorithm analysis:
> > > The Min Hop algorithm is divided into two stages: computation of
> > > min-hop tables on every switch and LFT output port assignment.
> > >
> > > Step 1: Computation of Min-Hop Tables on each switch
> > >
> > > The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with
> > > 2500 switch devices this ends up as 812M-Byte (using LMC=0).
> > >
> > > Meta algorithm:
> > > For each HCA mark its remote neighbor switch port with hop 1.
> > > For each switch mark itself port 0 as hop 0
> > > While changes
> > > For each switch
> > > For each port
> > > For each LID
> > > Propagate remote port hop as hop +1 if smaller or undefined
> > >
> > > The order of this step: O(S*P*(S+H)*(D+1))
> > >
> > > Step 2: Assigning output port:
> > > For each switch
> > > For each LID
> > > For each Port
> > > Is it the one with min count
> > >
> > > The order of this step: O(S*(S+H)*P)
> > >
> > > 3.2 Up/Down algorithm analysis
> > > The Up/Down algorithm depends on the ability to rank the fabric
> nodes
> > > from root to leaf of the tree. To get that ranking it runs a
> > > heuristics that is based on the Min Hop tables. So the memory and
> > > complexity are identical to the Min-Hop first step to start with.
> > >
> > > Once ranking is performed the algorithm is BFS from every HCA and
> fill
> > > in the Min Hop tables again. Up/Down traversal rule is enforced
> during
> > > the BFS such that only valid turns are allowed.
> > >
> > > Meta algorithm:
> > > For each HCA
> > > Get connected Switch
> > > For each Switch in NextSwitches
> > > For each Port
> > > Check if direction is OK. Check if not visited
> > >
> > > The order is O(H*S*P)
> > >
> > > To finalize output port assignment the second step of the Min Hop
> > > algorithm is invoked.
> > >
> > >
> > > 4. Proposed Algorithm:
> > > ----------------------
> > >
> > > Inspecting the routing problem we have noticed the following
> > > attributes:
> > > a. Using Min-Hop tables for keeping intermediate routing
> information
> > > has a disadvantage in terms of memory consumption. However, any
> > > incremental routing algorithm (for handling fabric changes after
> > > first setup), or routing persistence solution could use this
> > > information and gain speed.
> > > b. Since we need to fill in LFT tables that are of the order S*(S+H)
> > > the algorithm is lower bounded by O(H^2).
> > > c. A persistence based solution which uses previously routed fabric
> > > data and is able to handle simple incremental changes will
> provide
> > > a much faster runtime as topology match will require O(S*P)
> > > (traversing all links once)
> > > d. Since the minimum hops information is identical for a switch and
> > > all the HCAs connected to it - there is no point in building "min
> > > hop" tables for HCAs. During the "output port" assignment stage,
> > > the HCAs connected to each switch are considered and routed.
> > >
> > > The result of "a" is that several algorithms that are superior from
> > > memory footprint and skip any "hop table" stage are not considered
> for
> > > implementation.
> > >
> > > To support "d" we needed to provide an index to each switch such
> that
> > > the "min hop" tables are dense (previously they were indexed by
> LID).
> > > The new index is stored on the switch object and thus allow lookup
> of
> > > a switch "min hop" by its index. An array of switch pointer by index
> > > supports the reverse lookup.
> > >
> > > The suggested algorithm is broken into the following 3 stages:
> > > * Root nodes identification heuristics
> > > * Min Hop tables computation
> > > * Output port assignment
> > >
> > > 4.1 Root nodes identification heuristics:
> > > This step is only required under the AND of the following two
> > > conditions:
> > > * Up/Down routing is required
> > > * The user does not provide a file with guids of the tree "root
> nodes"
> > >
> > > This heuristics for recognizing the tree roots is based on
> histograms
> > > of the HCAs distance from every switch.
> > > I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to
> fill
> > > in these histograms on all switches we need to BFS from every leaf
> > > switch and propagate the number of HCAs connected to it:
> > >
> > > Meta algorithm:
> > > For each switch
> > > For each Port
> > > If connected to HCAs count them
> > > If any HCA
> > > Init BFS to start with current switch
> > > set hop count to 0
> > > While there are switches in BFS list
> > > increment hop count
> > > For each switch in BFS List
> > > Add the number of HCAs to the histogram at the current hop
> count
> > > For each port
> > > If remote port switch not visited
> > > Add the switch to the BFS Next Step List
> > > Once finished all this step list use next steps
> > >
> > > The order of this step is: O(S*P + H/L*S*P) = O(*H*S)
> > >
> > > 4.2 Min Hop tables computation:
> > > This step is mandatory and has a slightly different flavor for the
> > > case of Up/Down routing.
> > >
> > > The algorithm starts from every leaf switch and traverses BFS wise
> > > through the fabric.
> > >
> > > Meta algorithm:
> > > foreach switch in the fabric
> > > clear the "Rank" vector for all switches.
> > > start BFS with the given switch.
> > > set rank to 0
> > > while any switch in BFS list
> > > | foreach switch in BFS switch list
> > > | |foreach port (valid, active, not unhealthy)
> > > | | if remote side is a switch:
> > > | | if rank of remote side 0 or = rank + 1
> > > | | set the remote port entry MinHopPort for this switch
> > > | | if rank of remote side 0 i.e. never visited
> > > | | set the remote switch rank to rank + 1
> > > | | add the remote switch to next BFS switches
> > > | |------
> > > | switch between the current and next switches list
> > > | increment rank
> > > |------
> > >
> > > The order of this algorithm: O(P*S^2)
> > >
> > > Algorithm that merges Up/Down step criteria not yet written for this
> > > stage. But the idea is to make each step keep track of any previous
> > > step down. Such that a step up will be prohibited in this case.
> > >
> > > 4.3 Output port assignment:
> > > This step provide actual LFT values assignment to each switch.
> > > To do that we access the built "min hop" tables and track port
> usage.
> > >
> > > Meta algorithm:
> > > foreach switch in the fabric
> > > clear the port subscription vector (track number of paths
> subscribed)
> > > foreach target switch in the "min hop" table
> > > get the list of min-hop ports
> > > foreach end-node attached (HCA connected to it and itself)
> > > if lmc > 0 init tracking of remote system and node for selecting
> > > disjoint paths for same end-node different LID LSBs
> > > get min-subscribed (and disjoint) port marked min-hop target
> switch
> > > track port usage in port-subscription (opt. if target LID is not
> a switch)
> > >
> > > Order of this step:
> > > Currently the selection of the output port by min-subscription is
> > > trivial and requires O(P) so the overall order is
> > > O(S*S*(1+L)*P) <= O(16*P*S^2)
> > >
> > > One could obtain the list of marked min-hop ports and then use a
> > > modified cyclic list for avoiding the search for min subscription in
> > > the case of LMC > 0. In that case the order could be reduced to:
> > > O(S*S*(1+L)) ~= O(S*H)
> > >
> > > 5. Min Hop Table Implementation:
> > > --------------------------------
> > > The proposed algorithm does not require storing the number of hops
> > > arriving at the switch port - but only the fact a port is on the min
> > > hop path. This allowed for another memory usage reduction if the min
> > > hop table would be of boolean values.
> > >
> > > The issue then is in an efficiant iterator on the boolean (bits)
> > > array. The tradeoff is thus the common memory versus runtime.
> > >
> > > (Anybody knowns off a fast boolean array lookup implementation ?)
> > >
> > > 6. Incremental Routing:
> > > -----------------------
> > > Once the fabric is routed we can define an algorithm for performing
> > > incremental routing changes. An obvious case is when a link is
> > > declared un-healthy or one of the ports is dropped. Assuming the
> > > recognition of the change is done by some other algorithm. The
> > > following cases apply:
> > > * If the link connects HCA and a switch the HCA is unreachable. No
> > > routing change required.
> > > * If the link is between switches:
> > > * If there at least one another link between these switches:
> > > o Spread all routes going through the failing port to the other
> > > ports connecting to the same switch.
> > > * If there is no other link to these switches
> > > o Go back to all switches that feed into each one of the
> switches
> > > (feed in means they route some target lids through the
> switch)
> > > but only those that route lids that go through the failing
> port. Check
> > > to see if there is another port that goes to a different
> switch
> > > to route that lid to. If there is no other way do nothing.
> > >
> > > How do we support topology changes line moving an HCA from one
> Switch
> > > to another?
> > >
> > > 7. Routing Persistancy:
> > > -----------------------
> > > To make the subnet initialization faster, one could store the
> existing
> > > routing solution and use it without any calculation.
> > >
> > > The issue is of course what conditions makes the stored routing
> > > obsolete. To maximize the usefullnes of the stored information we
> > > propose to store the Min Hop tables rather then the final port
> > > assignment. It is assumed that after restart there might be a need
> for
> > > modifications to LMC and routing which will invalidate the LFT
> > > anyways. To enable "cache invalidation criteria" the persistent
> > > database should include information that could be used to easily
> check
> > > if the fabric was not altered in a way that invalidates the MinHop
> tables.
> > >
> > > The stored information should hold for each switch in the fabric (by
> guid)
> > > the list of ports and the guids and port numbers on the remote side.
> > > To validate there are no significant changes, the discovered set
> > > of switches is checked to match the stored information. Table 1
> > > describes the possible changes and their effect on the validity of
> > > the MinHop tables.
> > >
> > > Table 1 - Connectivity Changes effect on Routing Info Validity
> > >
> > > Change | Effect on MinHop Tables | Effect on LFT and MFT
> |
> > >
> ------------------------------------------------------------------------
> -
> > > New Switch found | Invalidates (might connect | Invalidates
> |
> > > | more HCAs and carries more |
> |
> > > | routing resources) |
> |
> > >
> ------------------------------------------------------------------------
> -
> > > Missing Switch | Invalidates (MinHops might | Invalidates
> |
> > > | be broken a few steps away)|
> |
> > >
> ------------------------------------------------------------------------
> -
> > > New cable found | Does not invalidate | Does not invalidate
> |
> > >
> ------------------------------------------------------------------------
> -
> > > Missing Cable | Invalidates only if there | Invalidates all LIDs
> |
> > > (SW to SW) | is no other cable | going through that
> port |
> > > | connecting the switches |
> |
> > >
> ------------------------------------------------------------------------
> -
> > > Missing Cable | Does no invalidate | Does not invalidate
> |
> > > (SW to HCA) | |
> |
> > >
> ------------------------------------------------------------------------
> -
> > > New HCA | Does no invalidate | Does not invalidate
> |
> > >
> ------------------------------------------------------------------------
> -
> > > Missing HCA | Does no invalidate | Does not invalidate
> |
> > >
> ------------------------------------------------------------------------
> -
> > > LID Changes | Does no invalidate | Invalidates the
> modified|
> > > | | LIDs
> |
> > >
> ------------------------------------------------------------------------
> -
> > >
> > > Special marking for "root nodes" shold cache the results of the
> first
> > > step for Up/Dpwn routing. These nodes should be invalidated on any
> > > missing or additional switch conditions.
> > >
> > >
More information about the general
mailing list