[openib-general] Dump and load routes with opensm?

Eitan Zahavi eitan at mellanox.co.il
Mon May 8 09:14:16 PDT 2006


> > > >
> > > > A plan for OpenSM to support loading unicast routes already
exists.
> > >
> > > What's the timeframe for implementing this ? Has the
implementation
> > > already started ? Is it what you write below and had posted to the
> > list
> > > last summer ?
> > [EZ] Yes: it is the same proposal. No: the work has not started.
> > >
> > > > To do it we need to develop a scheme where the routes file also
> > holds
> > > > the topology using GUIDs
> > > > such that the discovered topology is compared to the saved one.
I
> > have
> > > > outlined the algorithm in
> > > > the attached writeup on OpenSM routing algorithms section 6:
> > > > "Incremental Algorithms".
> > >
> > > Wouldn't it be better to separate out the persistency and
incremental
> > > issues ?
> > [EZ] I think that once you load the routes in the file you probably
need
> > to compare the topology used for generating them with the current
one.
> 
> Absolutely.
> 
> > The complexity level of "incremental" support - i.e. the ability to
> > gracefully handle some topology change - might be built with time.
So
> > first implementation might be limited to compare with reference
topology
> > and reroute on any change.
> 
> The reroute on change might ultimately be the default but there might
be
> an option to disable this.
[EZ] OK I see. But this means you intentionally agree to have partial
connectivity (i.e. no route from every host to every other host).
> 
> >  But it is a bit crude in my mind.
> 
> Yes. It would allow for the experimentation I think they desire
though.
> 
> > > Also, I sent numerous questions and comments on this writeup some
time
> > > ago.
> > [EZ] I thought I did answer them.
> 
> I don't think so but I need to dig this out of my email. It's been
quite
> a while...
[EZ] I'll try my archive too
> 
> > > > The right place to implement this will be to have the osm_db*
> > enhanced
> > > > to support this new DB domain.
> > > > Then the osm_ucast_mgr.c will need to initialize the DB and use
it
> > while
> > > > routing.
> > >
> > > Also, changes versus this either need disabling or handling.
> > [EZ] Not following you here. If you mean we need to be able to "shut
> > off" this new capability - sure.
> 
> I was referring to more than this (in terms of disabling the reroute
on
> change).
[EZ] OK. 
> 
> -- Hal
> 
> > > -- Hal
> > >
> > > > Regarding dump of existing tables:
> > > >
> > > > If you run opensm -V or -D 0x43 you should get the file
> > /tmp/osm.fdbs
> > > > with that dump.
> > > > You can also use a more SM independent method for obtaining the
> > tables :
> > > > if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs
> > with
> > > > similar format.
> > > >
> > > > Eitan
> > > >
> > > > Greg Johnson wrote:
> > > > > On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock
wrote:
> > > > >
> > > > >>Hi Greg,
> > > > >>
> > > > >>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote:
> > > > >>
> > > > >>>Is there currently a way to dump and load routes with opensm?
If
> > > > not,
> > > > >>>how would I go about writing one?
> > > > >>
> > > > >>Is it really routes or stable LIDs you want ?
> > > > >
> > > > >
> > > > > I actually want routes.  I have queried them with ibtraceroute
and
> > > > > ibroute, but we need routes for the whole fabric.
> > > > >
> > > > > BTW, if you call ibtraceroute thousands of times it stops
working.
> > > > > Maybe a problem in the MAD driver?
> > > > >
> > > > >
> > > > >
> > > > >>LIDs are stored in /var/cache/osm/guid2lid and restored from
there
> > > > when
> > > > >>OpenSM is started assuming the reassign LIDs option (-r or
> > > > >>--reassign_lids) is not used when invoking OpenSM.
> > > > >
> > > > >
> > > > > Thanks, that's good to know.
> > > > >
> > > > > Greg
> > > > > _______________________________________________
> > > > > openib-general mailing list
> > > > > openib-general at openib.org
> > > > > http://openib.org/mailman/listinfo/openib-general
> > > > >
> > > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > > >
> > > >
> > > >
> > > >
> >
______________________________________________________________________
> > > >
> > > >               OpenSM Unicast Routing Enhancements for
Scalability
> > > >
=====================================================
> > > >
> > > > Authors:Eitan Zahavi <eitan at mellanox.co.il>, Yael Kalka
> > <yael at mellanox.co.il>
> > > > Date:   Aug 2005.
> > > >
> > > > Table of contents:
> > > > 1. Overview
> > > > 2. Notation
> > > > 3. Current Algorithms
> > > > 4. Proposed Routing Algorithms
> > > > 5. Min Hop Tables Implementation
> > > > 6. Incremental Routing
> > > > 7. Routing Persistancy
> > > >
> > > > 1. Overview:
> > > > ------------
> > > >
> > > > OpenSM currently uses a two stage routing algorithm for unicast
> > > > forwarding tables calculation. As shown later these algorithm
are
> > > > O(N^3). Inspected run time of OpenSM routing stage was ~2.5min
on
> > 1100
> > > > nodes cluster. The purpose of this memo is to present the
community
> > > > the proposed work for enhancing OpenSM routing engine.
> > > >
> > > > 2. Notation:
> > > > ------------
> > > > The following notations are used throughout this document:
> > > > S = Number of switch devices in the system
> > > > P = Number of ports each switch node has
> > > > H = Number of HCA ports connected to the fabric
> > > > L = Number of HCAs connected to each Leaf switch device.
> > > >     Normal values are 1/2P to 3/4P
> > > > D = Fat Tree depth
> > > >
> > > > 3. Current Algorithms:
> > > > ----------------------
> > > > OpenSM provide two routing algorithms: Minimal Hop and Up/Down.
Both
> > > > of them do not scale with cluster size and can consume both
large
> > run
> > > > time (minutes) and memory (GB). This section provides meta code
for
> > > > these algorithms and order calculation.
> > > >
> > > > 3.1 Min Hop algorithm analysis:
> > > > The Min Hop algorithm is divided into two stages: computation of
> > > > min-hop tables on every switch and LFT output port assignment.
> > > >
> > > > Step 1: Computation of Min-Hop Tables on each switch
> > > >
> > > > The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster
with
> > > > 2500 switch devices this ends up as 812M-Byte (using LMC=0).
> > > >
> > > > Meta algorithm:
> > > > For each HCA mark its remote neighbor switch port with hop 1.
> > > > For each switch mark itself port 0 as hop 0
> > > > While changes
> > > >   For each switch
> > > >      For each port
> > > >        For each LID
> > > >          Propagate remote port hop as hop +1 if smaller or
undefined
> > > >
> > > > The order of this step: O(S*P*(S+H)*(D+1))
> > > >
> > > > Step 2: Assigning output port:
> > > > For each switch
> > > >   For each LID
> > > >      For each Port
> > > >         Is it the one with min count
> > > >
> > > > The order of this step: O(S*(S+H)*P)
> > > >
> > > > 3.2 Up/Down algorithm analysis
> > > > The Up/Down algorithm depends on the ability to rank the fabric
> > nodes
> > > > from root to leaf of the tree. To get that ranking it runs a
> > > > heuristics that is based on the Min Hop tables. So the memory
and
> > > > complexity are identical to the Min-Hop first step to start
with.
> > > >
> > > > Once ranking is performed the algorithm is BFS from every HCA
and
> > fill
> > > > in the Min Hop tables again. Up/Down traversal rule is enforced
> > during
> > > > the BFS such that only valid turns are allowed.
> > > >
> > > > Meta algorithm:
> > > > For each HCA
> > > >   Get connected Switch
> > > >      For each Switch in NextSwitches
> > > >         For each Port
> > > >            Check if direction is OK. Check if not visited
> > > >
> > > > The order is O(H*S*P)
> > > >
> > > > To finalize output port assignment the second step of the Min
Hop
> > > > algorithm is invoked.
> > > >
> > > >
> > > > 4. Proposed Algorithm:
> > > > ----------------------
> > > >
> > > > Inspecting the routing problem we have noticed the following
> > > > attributes:
> > > > a. Using Min-Hop tables for keeping  intermediate routing
> > information
> > > >    has a disadvantage in terms of memory consumption. However,
any
> > > >    incremental routing algorithm (for handling fabric changes
after
> > > >    first setup), or routing persistence solution could use this
> > > >    information and gain speed.
> > > > b. Since we need to fill in LFT tables that are of the order
S*(S+H)
> > > >    the algorithm is lower bounded by O(H^2).
> > > > c. A persistence based solution which uses previously routed
fabric
> > > >    data and is able to handle simple incremental changes will
> > provide
> > > >    a much faster runtime as topology match will require O(S*P)
> > > >    (traversing all links once)
> > > > d. Since the minimum hops information is identical for a switch
and
> > > >    all the HCAs connected to it - there is no point in building
"min
> > > >    hop" tables for HCAs. During the "output port" assignment
stage,
> > > >    the HCAs connected to each switch are considered and routed.
> > > >
> > > > The result of "a" is that several algorithms that are superior
from
> > > > memory footprint and skip any "hop table" stage are not
considered
> > for
> > > > implementation.
> > > >
> > > > To support "d" we needed to provide an index to each switch such
> > that
> > > > the "min hop" tables are dense (previously they were indexed by
> > LID).
> > > > The new index is stored on the switch object and thus allow
lookup
> > of
> > > > a switch "min hop" by its index. An array of switch pointer by
index
> > > > supports the reverse lookup.
> > > >
> > > > The suggested algorithm is broken into the following 3 stages:
> > > > * Root nodes identification heuristics
> > > > * Min Hop tables computation
> > > > * Output port assignment
> > > >
> > > > 4.1 Root nodes identification heuristics:
> > > > This step is only required under the AND of the following two
> > > > conditions:
> > > > * Up/Down routing is required
> > > > * The user does not provide a file with guids of the tree "root
> > nodes"
> > > >
> > > > This heuristics for recognizing the tree roots is based on
> > histograms
> > > > of the HCAs distance from every switch.
> > > > I.e. How many HCAs are 1 hop, 2 hops from the switch. In order
to
> > fill
> > > > in these histograms on all switches we need to BFS from every
leaf
> > > > switch and propagate the number of HCAs connected to it:
> > > >
> > > > Meta algorithm:
> > > > For each switch
> > > >   For each Port
> > > >     If connected to HCAs count them
> > > >     If any HCA
> > > >      Init BFS to start with current switch
> > > >      set hop count to 0
> > > >      While there are switches in BFS list
> > > >        increment hop count
> > > >        For each switch in BFS List
> > > >          Add the number of HCAs to the histogram at the current
hop
> > count
> > > >         For each port
> > > >          If remote port switch not visited
> > > >           Add the switch to the BFS Next Step List
> > > >       Once finished all this step list use next steps
> > > >
> > > > The order of this step is: O(S*P + H/L*S*P) = O(*H*S)
> > > >
> > > > 4.2 Min Hop tables computation:
> > > > This step is mandatory and has a slightly different flavor for
the
> > > > case of Up/Down routing.
> > > >
> > > > The algorithm starts from every leaf switch and traverses BFS
wise
> > > > through the fabric.
> > > >
> > > > Meta algorithm:
> > > > foreach switch in the fabric
> > > >  clear the "Rank" vector for all switches.
> > > >  start BFS with the given switch.
> > > >  set rank to 0
> > > >  while any switch in BFS list
> > > >  | foreach switch in BFS switch list
> > > >  | |foreach port (valid, active, not unhealthy)
> > > >  | | if remote side is a switch:
> > > >  | |  if rank of remote side 0 or = rank + 1
> > > >  | |    set the remote port entry MinHopPort for this switch
> > > >  | |  if rank of remote side 0 i.e. never visited
> > > >  | |    set the remote switch rank to rank + 1
> > > >  | |    add the remote switch to next BFS switches
> > > >  | |------
> > > >  | switch between the current and next switches list
> > > >  | increment rank
> > > >  |------
> > > >
> > > > The order of this algorithm: O(P*S^2)
> > > >
> > > > Algorithm that merges Up/Down step criteria not yet written for
this
> > > > stage. But the idea is to make each step keep track of any
previous
> > > > step down. Such that a step up will be prohibited in this case.
> > > >
> > > > 4.3 Output port assignment:
> > > > This step provide actual LFT values assignment to each switch.
> > > > To do that we access the built "min hop" tables and track port
> > usage.
> > > >
> > > > Meta algorithm:
> > > > foreach switch in the fabric
> > > >  clear the port subscription vector (track number of paths
> > subscribed)
> > > >  foreach target switch in the "min hop" table
> > > >   get the list of min-hop ports
> > > >   foreach end-node attached (HCA connected to it and itself)
> > > >    if lmc > 0 init tracking of remote system and node for
selecting
> > > >       disjoint paths for same end-node different LID LSBs
> > > >    get min-subscribed (and disjoint) port marked min-hop target
> > switch
> > > >    track port usage in port-subscription (opt. if target LID is
not
> > a switch)
> > > >
> > > > Order of this step:
> > > > Currently the selection of the output port by min-subscription
is
> > > > trivial and requires O(P) so the overall order is
> > > > O(S*S*(1+L)*P) <= O(16*P*S^2)
> > > >
> > > > One could obtain the list of marked min-hop ports and then use a
> > > > modified cyclic list for avoiding the search for min
subscription in
> > > > the case of LMC > 0. In that case the order could be reduced to:
> > > > O(S*S*(1+L)) ~= O(S*H)
> > > >
> > > > 5. Min Hop Table Implementation:
> > > > --------------------------------
> > > > The proposed algorithm does not require storing the number of
hops
> > > > arriving at the switch port - but only the fact a port is on the
min
> > > > hop path. This allowed for another memory usage reduction if the
min
> > > > hop table would be of boolean values.
> > > >
> > > > The issue then is in an efficiant iterator on the boolean (bits)
> > > > array. The tradeoff is thus the common memory versus runtime.
> > > >
> > > > (Anybody knowns off a fast boolean array lookup implementation
?)
> > > >
> > > > 6. Incremental Routing:
> > > > -----------------------
> > > > Once the fabric is routed we can define an algorithm for
performing
> > > > incremental routing changes. An obvious case is when a link is
> > > > declared un-healthy or one of the ports is dropped. Assuming the
> > > > recognition of the change is done by some other algorithm. The
> > > > following cases apply:
> > > > *	If the link connects HCA and a switch the HCA is
unreachable. No
> > > >    routing change required.
> > > > *	If the link is between switches:
> > > >    * If there at least one another link between these switches:
> > > >      o Spread all routes going through the failing port to the
other
> > > >        ports connecting to the same switch.
> > > >    * If there is no other link to these switches
> > > >      o Go back to all switches that feed into each one of the
> > switches
> > > >        (feed in means they route some target lids through the
> > switch)
> > > >        but only those that route lids that go through the
failing
> > port. Check
> > > >        to see if there is another port that goes to a different
> > switch
> > > >        to route that lid to. If there is no other way do
nothing.
> > > >
> > > > How do we support topology changes line moving an HCA from one
> > Switch
> > > > to another?
> > > >
> > > > 7. Routing Persistancy:
> > > > -----------------------
> > > > To make the subnet initialization faster, one could store the
> > existing
> > > > routing solution and use it without any calculation.
> > > >
> > > > The issue is of course what conditions makes the stored routing
> > > > obsolete. To maximize the usefullnes of the stored information
we
> > > > propose to store the Min Hop tables rather then the final port
> > > > assignment. It is assumed that after restart there might be a
need
> > for
> > > > modifications to LMC and routing which will invalidate the LFT
> > > > anyways. To enable "cache invalidation criteria" the persistent
> > > > database should include information that could be used to easily
> > check
> > > > if the fabric was not altered in a way that invalidates the
MinHop
> > tables.
> > > >
> > > > The stored information should hold for each switch in the fabric
(by
> > guid)
> > > > the list of ports and the guids and port numbers on the remote
side.
> > > > To validate there are no significant changes, the discovered set
> > > > of switches is checked to match the stored information. Table 1
> > > > describes the possible changes and their effect on the  validity
of
> > > > the MinHop tables.
> > > >
> > > > Table 1 - Connectivity Changes effect on Routing Info Validity
> > > >
> > > > Change           | Effect on MinHop Tables	 | Effect on LFT
and MFT
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > New Switch found | Invalidates (might connect | Invalidates
> > |
> > > >                  | more HCAs and carries more |
> > |
> > > >                  | routing resources)         |
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > Missing Switch   | Invalidates (MinHops might | Invalidates
> > |
> > > >                  | be broken a few steps away)|
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > New cable found  | Does not invalidate        | Does not
invalidate
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > Missing Cable    | Invalidates only if there  | Invalidates all
LIDs
> > |
> > > > (SW to SW)       | is no other cable          | going through
that
> > port |
> > > >                  | connecting the switches    |
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > Missing Cable    | Does no invalidate         | Does not
invalidate
> > |
> > > > (SW to HCA)      |                            |
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > New HCA          | Does no invalidate         | Does not
invalidate
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > Missing HCA      | Does no invalidate         | Does not
invalidate
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > > LID Changes      | Does no invalidate         | Invalidates the
> > modified|
> > > >                  |                            | LIDs
> > |
> > > >
> >
------------------------------------------------------------------------
> > -
> > > >
> > > > Special marking for "root nodes" shold cache the results of the
> > first
> > > > step for Up/Dpwn routing. These nodes should be invalidated on
any
> > > > missing or additional switch conditions.
> > > >
> > > >



More information about the general mailing list