[openib-general] Dump and load routes with opensm?

Eitan Zahavi eitan at mellanox.co.il
Mon May 8 07:00:22 PDT 2006


> >
> > A plan for OpenSM to support loading unicast routes already exists.
> 
> What's the timeframe for implementing this ? Has the implementation
> already started ? Is it what you write below and had posted to the
list
> last summer ?
[EZ] Yes: it is the same proposal. No: the work has not started.
> 
> > To do it we need to develop a scheme where the routes file also
holds
> > the topology using GUIDs
> > such that the discovered topology is compared to the saved one. I
have
> > outlined the algorithm in
> > the attached writeup on OpenSM routing algorithms section 6:
> > "Incremental Algorithms".
> 
> Wouldn't it be better to separate out the persistency and incremental
> issues ?
[EZ] I think that once you load the routes in the file you probably need
to compare the topology used for generating them with the current one.
The complexity level of "incremental" support - i.e. the ability to
gracefully handle some topology change - might be built with time. So
first implementation might be limited to compare with reference topology
and reroute on any change. But it is a bit crude in my mind.
> 
> Also, I sent numerous questions and comments on this writeup some time
> ago.
[EZ] I thought I did answer them. 
> 
> > The right place to implement this will be to have the osm_db*
enhanced
> > to support this new DB domain.
> > Then the osm_ucast_mgr.c will need to initialize the DB and use it
while
> > routing.
> 
> Also, changes versus this either need disabling or handling.
[EZ] Not following you here. If you mean we need to be able to "shut
off" this new capability - sure.
> 
> -- Hal
> 
> > Regarding dump of existing tables:
> >
> > If you run opensm -V or -D 0x43 you should get the file
/tmp/osm.fdbs
> > with that dump.
> > You can also use a more SM independent method for obtaining the
tables :
> > if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs
with
> > similar format.
> >
> > Eitan
> >
> > Greg Johnson wrote:
> > > On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote:
> > >
> > >>Hi Greg,
> > >>
> > >>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote:
> > >>
> > >>>Is there currently a way to dump and load routes with opensm?  If
> > not,
> > >>>how would I go about writing one?
> > >>
> > >>Is it really routes or stable LIDs you want ?
> > >
> > >
> > > I actually want routes.  I have queried them with ibtraceroute and
> > > ibroute, but we need routes for the whole fabric.
> > >
> > > BTW, if you call ibtraceroute thousands of times it stops working.
> > > Maybe a problem in the MAD driver?
> > >
> > >
> > >
> > >>LIDs are stored in /var/cache/osm/guid2lid and restored from there
> > when
> > >>OpenSM is started assuming the reassign LIDs option (-r or
> > >>--reassign_lids) is not used when invoking OpenSM.
> > >
> > >
> > > Thanks, that's good to know.
> > >
> > > Greg
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
> >
> >
> >
> >
______________________________________________________________________
> >
> >               OpenSM Unicast Routing Enhancements for Scalability
> >              =====================================================
> >
> > Authors:Eitan Zahavi <eitan at mellanox.co.il>, Yael Kalka
<yael at mellanox.co.il>
> > Date:   Aug 2005.
> >
> > Table of contents:
> > 1. Overview
> > 2. Notation
> > 3. Current Algorithms
> > 4. Proposed Routing Algorithms
> > 5. Min Hop Tables Implementation
> > 6. Incremental Routing
> > 7. Routing Persistancy
> >
> > 1. Overview:
> > ------------
> >
> > OpenSM currently uses a two stage routing algorithm for unicast
> > forwarding tables calculation. As shown later these algorithm are
> > O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on
1100
> > nodes cluster. The purpose of this memo is to present the community
> > the proposed work for enhancing OpenSM routing engine.
> >
> > 2. Notation:
> > ------------
> > The following notations are used throughout this document:
> > S = Number of switch devices in the system
> > P = Number of ports each switch node has
> > H = Number of HCA ports connected to the fabric
> > L = Number of HCAs connected to each Leaf switch device.
> >     Normal values are 1/2P to 3/4P
> > D = Fat Tree depth
> >
> > 3. Current Algorithms:
> > ----------------------
> > OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both
> > of them do not scale with cluster size and can consume both large
run
> > time (minutes) and memory (GB). This section provides meta code for
> > these algorithms and order calculation.
> >
> > 3.1 Min Hop algorithm analysis:
> > The Min Hop algorithm is divided into two stages: computation of
> > min-hop tables on every switch and LFT output port assignment.
> >
> > Step 1: Computation of Min-Hop Tables on each switch
> >
> > The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with
> > 2500 switch devices this ends up as 812M-Byte (using LMC=0).
> >
> > Meta algorithm:
> > For each HCA mark its remote neighbor switch port with hop 1.
> > For each switch mark itself port 0 as hop 0
> > While changes
> >   For each switch
> >      For each port
> >        For each LID
> >          Propagate remote port hop as hop +1 if smaller or undefined
> >
> > The order of this step: O(S*P*(S+H)*(D+1))
> >
> > Step 2: Assigning output port:
> > For each switch
> >   For each LID
> >      For each Port
> >         Is it the one with min count
> >
> > The order of this step: O(S*(S+H)*P)
> >
> > 3.2 Up/Down algorithm analysis
> > The Up/Down algorithm depends on the ability to rank the fabric
nodes
> > from root to leaf of the tree. To get that ranking it runs a
> > heuristics that is based on the Min Hop tables. So the memory and
> > complexity are identical to the Min-Hop first step to start with.
> >
> > Once ranking is performed the algorithm is BFS from every HCA and
fill
> > in the Min Hop tables again. Up/Down traversal rule is enforced
during
> > the BFS such that only valid turns are allowed.
> >
> > Meta algorithm:
> > For each HCA
> >   Get connected Switch
> >      For each Switch in NextSwitches
> >         For each Port
> >            Check if direction is OK. Check if not visited
> >
> > The order is O(H*S*P)
> >
> > To finalize output port assignment the second step of the Min Hop
> > algorithm is invoked.
> >
> >
> > 4. Proposed Algorithm:
> > ----------------------
> >
> > Inspecting the routing problem we have noticed the following
> > attributes:
> > a. Using Min-Hop tables for keeping  intermediate routing
information
> >    has a disadvantage in terms of memory consumption. However, any
> >    incremental routing algorithm (for handling fabric changes after
> >    first setup), or routing persistence solution could use this
> >    information and gain speed.
> > b. Since we need to fill in LFT tables that are of the order S*(S+H)
> >    the algorithm is lower bounded by O(H^2).
> > c. A persistence based solution which uses previously routed fabric
> >    data and is able to handle simple incremental changes will
provide
> >    a much faster runtime as topology match will require O(S*P)
> >    (traversing all links once)
> > d. Since the minimum hops information is identical for a switch and
> >    all the HCAs connected to it - there is no point in building "min
> >    hop" tables for HCAs. During the "output port" assignment stage,
> >    the HCAs connected to each switch are considered and routed.
> >
> > The result of "a" is that several algorithms that are superior from
> > memory footprint and skip any "hop table" stage are not considered
for
> > implementation.
> >
> > To support "d" we needed to provide an index to each switch such
that
> > the "min hop" tables are dense (previously they were indexed by
LID).
> > The new index is stored on the switch object and thus allow lookup
of
> > a switch "min hop" by its index. An array of switch pointer by index
> > supports the reverse lookup.
> >
> > The suggested algorithm is broken into the following 3 stages:
> > * Root nodes identification heuristics
> > * Min Hop tables computation
> > * Output port assignment
> >
> > 4.1 Root nodes identification heuristics:
> > This step is only required under the AND of the following two
> > conditions:
> > * Up/Down routing is required
> > * The user does not provide a file with guids of the tree "root
nodes"
> >
> > This heuristics for recognizing the tree roots is based on
histograms
> > of the HCAs distance from every switch.
> > I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to
fill
> > in these histograms on all switches we need to BFS from every leaf
> > switch and propagate the number of HCAs connected to it:
> >
> > Meta algorithm:
> > For each switch
> >   For each Port
> >     If connected to HCAs count them
> >     If any HCA
> >      Init BFS to start with current switch
> >      set hop count to 0
> >      While there are switches in BFS list
> >        increment hop count
> >        For each switch in BFS List
> >          Add the number of HCAs to the histogram at the current hop
count
> >         For each port
> >          If remote port switch not visited
> >           Add the switch to the BFS Next Step List
> >       Once finished all this step list use next steps
> >
> > The order of this step is: O(S*P + H/L*S*P) = O(*H*S)
> >
> > 4.2 Min Hop tables computation:
> > This step is mandatory and has a slightly different flavor for the
> > case of Up/Down routing.
> >
> > The algorithm starts from every leaf switch and traverses BFS wise
> > through the fabric.
> >
> > Meta algorithm:
> > foreach switch in the fabric
> >  clear the "Rank" vector for all switches.
> >  start BFS with the given switch.
> >  set rank to 0
> >  while any switch in BFS list
> >  | foreach switch in BFS switch list
> >  | |foreach port (valid, active, not unhealthy)
> >  | | if remote side is a switch:
> >  | |  if rank of remote side 0 or = rank + 1
> >  | |    set the remote port entry MinHopPort for this switch
> >  | |  if rank of remote side 0 i.e. never visited
> >  | |    set the remote switch rank to rank + 1
> >  | |    add the remote switch to next BFS switches
> >  | |------
> >  | switch between the current and next switches list
> >  | increment rank
> >  |------
> >
> > The order of this algorithm: O(P*S^2)
> >
> > Algorithm that merges Up/Down step criteria not yet written for this
> > stage. But the idea is to make each step keep track of any previous
> > step down. Such that a step up will be prohibited in this case.
> >
> > 4.3 Output port assignment:
> > This step provide actual LFT values assignment to each switch.
> > To do that we access the built "min hop" tables and track port
usage.
> >
> > Meta algorithm:
> > foreach switch in the fabric
> >  clear the port subscription vector (track number of paths
subscribed)
> >  foreach target switch in the "min hop" table
> >   get the list of min-hop ports
> >   foreach end-node attached (HCA connected to it and itself)
> >    if lmc > 0 init tracking of remote system and node for selecting
> >       disjoint paths for same end-node different LID LSBs
> >    get min-subscribed (and disjoint) port marked min-hop target
switch
> >    track port usage in port-subscription (opt. if target LID is not
a switch)
> >
> > Order of this step:
> > Currently the selection of the output port by min-subscription is
> > trivial and requires O(P) so the overall order is
> > O(S*S*(1+L)*P) <= O(16*P*S^2)
> >
> > One could obtain the list of marked min-hop ports and then use a
> > modified cyclic list for avoiding the search for min subscription in
> > the case of LMC > 0. In that case the order could be reduced to:
> > O(S*S*(1+L)) ~= O(S*H)
> >
> > 5. Min Hop Table Implementation:
> > --------------------------------
> > The proposed algorithm does not require storing the number of hops
> > arriving at the switch port - but only the fact a port is on the min
> > hop path. This allowed for another memory usage reduction if the min
> > hop table would be of boolean values.
> >
> > The issue then is in an efficiant iterator on the boolean (bits)
> > array. The tradeoff is thus the common memory versus runtime.
> >
> > (Anybody knowns off a fast boolean array lookup implementation ?)
> >
> > 6. Incremental Routing:
> > -----------------------
> > Once the fabric is routed we can define an algorithm for performing
> > incremental routing changes. An obvious case is when a link is
> > declared un-healthy or one of the ports is dropped. Assuming the
> > recognition of the change is done by some other algorithm. The
> > following cases apply:
> > *	If the link connects HCA and a switch the HCA is unreachable. No
> >    routing change required.
> > *	If the link is between switches:
> >    * If there at least one another link between these switches:
> >      o Spread all routes going through the failing port to the other
> >        ports connecting to the same switch.
> >    * If there is no other link to these switches
> >      o Go back to all switches that feed into each one of the
switches
> >        (feed in means they route some target lids through the
switch)
> >        but only those that route lids that go through the failing
port. Check
> >        to see if there is another port that goes to a different
switch
> >        to route that lid to. If there is no other way do nothing.
> >
> > How do we support topology changes line moving an HCA from one
Switch
> > to another?
> >
> > 7. Routing Persistancy:
> > -----------------------
> > To make the subnet initialization faster, one could store the
existing
> > routing solution and use it without any calculation.
> >
> > The issue is of course what conditions makes the stored routing
> > obsolete. To maximize the usefullnes of the stored information we
> > propose to store the Min Hop tables rather then the final port
> > assignment. It is assumed that after restart there might be a need
for
> > modifications to LMC and routing which will invalidate the LFT
> > anyways. To enable "cache invalidation criteria" the persistent
> > database should include information that could be used to easily
check
> > if the fabric was not altered in a way that invalidates the MinHop
tables.
> >
> > The stored information should hold for each switch in the fabric (by
guid)
> > the list of ports and the guids and port numbers on the remote side.
> > To validate there are no significant changes, the discovered set
> > of switches is checked to match the stored information. Table 1
> > describes the possible changes and their effect on the  validity of
> > the MinHop tables.
> >
> > Table 1 - Connectivity Changes effect on Routing Info Validity
> >
> > Change           | Effect on MinHop Tables	 | Effect on LFT and MFT
|
> >
------------------------------------------------------------------------
-
> > New Switch found | Invalidates (might connect | Invalidates
|
> >                  | more HCAs and carries more |
|
> >                  | routing resources)         |
|
> >
------------------------------------------------------------------------
-
> > Missing Switch   | Invalidates (MinHops might | Invalidates
|
> >                  | be broken a few steps away)|
|
> >
------------------------------------------------------------------------
-
> > New cable found  | Does not invalidate        | Does not invalidate
|
> >
------------------------------------------------------------------------
-
> > Missing Cable    | Invalidates only if there  | Invalidates all LIDs
|
> > (SW to SW)       | is no other cable          | going through that
port |
> >                  | connecting the switches    |
|
> >
------------------------------------------------------------------------
-
> > Missing Cable    | Does no invalidate         | Does not invalidate
|
> > (SW to HCA)      |                            |
|
> >
------------------------------------------------------------------------
-
> > New HCA          | Does no invalidate         | Does not invalidate
|
> >
------------------------------------------------------------------------
-
> > Missing HCA      | Does no invalidate         | Does not invalidate
|
> >
------------------------------------------------------------------------
-
> > LID Changes      | Does no invalidate         | Invalidates the
modified|
> >                  |                            | LIDs
|
> >
------------------------------------------------------------------------
-
> >
> > Special marking for "root nodes" shold cache the results of the
first
> > step for Up/Dpwn routing. These nodes should be invalidated on any
> > missing or additional switch conditions.
> >
> >



More information about the general mailing list