[ofa-general] Re: [PATCH] Update man page for configurable partition and prefix-routes file. (WAS: Re: [PATCH] opensm/man: partition cfg file location)

Hal Rosenstock hrosenstock at xsigo.com
Mon Feb 4 13:33:44 PST 2008


Ira,

On Mon, 2008-02-04 at 13:20 -0800, Ira Weiny wrote:
> This is my bad.  When I changed the config file locations and "configurability"
> I should have updated the man page.
> 
> This patch fixes this to reflect the location and names chosen at configure
> time.

Seems like this patch is modifying many lines in the opensm man page
aside from those affected. Is that needed ? That makes it hard to see
exactly what changed at least for me.

-- Hal

> There will be a follow on patch (another email) which adds the other config
> files to the "FILES" section.  It depends on this change.
> 
> Sorry,
> Ira
> 
> On Mon, 4 Feb 2008 18:51:19 +0000
> Sasha Khapyorsky <sashak at voltaire.com> wrote:
> 
> > On 18:48 Mon 04 Feb     , Sasha Khapyorsky wrote:
> > > > --- a/opensm/man/opensm.8
> > > > +++ b/opensm/man/opensm.8
> > > > @@ -200,7 +200,7 @@ is accumulative.
> > > >  .TP
> > > >  \fB\-P\fR, \fB\-\-Pconfig\fR
> > > >  This option defines the optional partition configuration file.
> > > > -The default name is \'/etc/opensm/opensm-partitions.conf\'.
> > > > +The default name is \'/etc/ofa/opensm-partitions.conf\'.
> > > 
> > > It is also wrong name - partition config file name is configurable with
> > > OpenSM (look at './configure --help') and default default value is
> > > '<sysconfdir>/opensm/partitions.conf'. 'opensm -h' shows valid value.
> > 
> > BTW in OFED-1.3 it is '/etc/opensm/partitions.conf'.
> > 
> > Sasha
> 
> <PATCH>
> 
> From cd1344594a988f2f18a903ac454ae858f42490c0 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Mon, 4 Feb 2008 13:05:16 -0800
> Subject: [PATCH] Update man page for configurable partition and prefix-routes file.
> 
>    This changes the man page to be auto-generated based on the chosen configure
>    options.
> 
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
> ---
>  opensm/configure.in    |    4 +
>  opensm/man/opensm.8    |  941 ------------------------------------------------
>  opensm/man/opensm.8.in |  941 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 945 insertions(+), 941 deletions(-)
>  delete mode 100644 opensm/man/opensm.8
>  create mode 100644 opensm/man/opensm.8.in
> 
> diff --git a/opensm/configure.in b/opensm/configure.in
> index 79a914e..455630e 100644
> --- a/opensm/configure.in
> +++ b/opensm/configure.in
> @@ -95,6 +95,7 @@ CONF_DIR="`eval echo $CONF_DIR_TMP2`"
>  AC_DEFINE_UNQUOTED(OPENSM_CONFIG_DIR,
>  	["$CONF_DIR"],
>  	[Define OpenSM config directory])
> +AC_SUBST(CONF_DIR)
>  
>  dnl Check for a different default node name map file
>  NODENAMEMAPFILE=ib-node-name-map
> @@ -135,6 +136,7 @@ AC_MSG_RESULT(${withpartitionsconf=no})
>  AC_DEFINE_UNQUOTED(HAVE_DEFAULT_PARTITION_CONFIG_FILE,
>  	["$CONF_DIR/$PARTITION_CONFIG_FILE"],
>  	[Define a QOS policy config file])
> +AC_SUBST(PARTITION_CONFIG_FILE)
>  
>  dnl Check for a different QOS policy file
>  QOS_POLICY_FILE=qos-policy.conf
> @@ -172,5 +174,7 @@ OPENIB_APP_OSMV_CHECK_LIB
>  # overrides.
>  CFLAGS=$ac_env_CFLAGS_value
>  
> +AC_CONFIG_FILES([man/opensm.8])
> +
>  dnl Create the following Makefiles
>  AC_OUTPUT([include/opensm/osm_version.h Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec])
> diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8
> deleted file mode 100644
> index ab7fb8e..0000000
> --- a/opensm/man/opensm.8
> +++ /dev/null
> @@ -1,941 +0,0 @@
> -.TH OPENSM 8 "Aug 16, 2007" "OpenIB" "OpenIB Management"
> -
> -.SH NAME
> -opensm \- InfiniBand subnet manager and administration (SM/SA)
> -
> -.SH SYNOPSIS
> -.B opensm
> -[\-c(ache-options)] [\-g(uid)[=]<GUID in hex>] [\-l(mc) <LMC>]
> -[\-p(riority) <PRIORITY>] [\-smkey <SM_Key>] [\-r(eassign_lids)]
> -[\-R <engine name> | \-\-routing_engine <engine name>]
> -[\-z | \-\-connect_roots]
> -[\-M <file name> | \-\-lid_matrix_file <file name>]
> -[\-U <file name> | \-\-ucast_file <file name>]
> -[\-S | \-\-sadb_file <file name>] [\-a | \-\-root_guid_file <path to file>]
> -[\-u | \-\-cn_guid_file <path to file>] [\-o(nce)] [\-s(weep) <interval>]
> -[\-t(imeout) <milliseconds>] [\-maxsmps <number>]
> -[\-console [off | local | socket | loopback]] [\-console-port <port>]
> -[\-i(gnore-guids) <equalize-ignore-guids-file>] [\-f | \-\-log_file]
> -[\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)] [\-P(config)]
> -[\-Q | \-\-qos] [\-N | \-\-no_part_enforce] [\-y | \-\-stay_on_fatal]
> -[\-B | \-\-daemon] [\-I | \-\-inactive]
> -[\-\-perfmgr] [\-\-perfmgr_sweep_time_s <seconds>]
> -[\-\-prefix_routes_file <path>]
> -[\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>] [\-h(elp)] [\-?]
> -
> -.SH DESCRIPTION
> -.PP
> -opensm is an InfiniBand compliant Subnet Manager and Administration,
> -and runs on top of OpenIB.
> -
> -opensm provides an implementation of an InfiniBand Subnet Manager and
> -Administration. Such a software entity is required to run for in order
> -to initialize the InfiniBand hardware (at least one per each
> -InfiniBand subnet).
> -
> -opensm also now contains an experimental version of a performance
> -manager as well.
> -
> -opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
> -fabric, initialize it, and sweep occasionally for changes.
> -
> -opensm attaches to a specific IB port on the local machine and configures only
> -the fabric connected to it. (If the local machine has other IB ports,
> -opensm will ignore the fabrics connected to those other ports). If no port is
> -specified, it will select the first "best" available port.
> -
> -opensm can present the available ports and prompt for a port number to
> -attach to.
> -
> -By default, the run is logged to two files: /var/log/messages and /var/log/opensm.log.
> -The first file will register only general major events, whereas the second
> -will include details of reported errors. All errors reported in this second
> -file should be treated as indicators of IB fabric health issues.
> -(Note that when a fatal and non-recoverable error occurs, opensm will exit.)
> -Both log files should include the message "SUBNET UP" if opensm was able to
> -setup the subnet correctly.
> -
> -.SH OPTIONS
> -
> -.PP
> -.TP
> -\fB\-c\fR, \fB\-\-cache-options\fR
> -Write out a list of all tunable OpenSM parameters,
> -including their current values from the command line
> -as well as defaults for others, into the file
> -OSM_CACHE_DIR/opensm.opts (OSM_CACHE_DIR defaults to
> -/var/cache/opensm if the corresponding environment
> -variable is not set). The options file is then
> -used for subsequent OpenSM invocations but any
> -command line options take precedence.
> -.TP
> -\fB\-g\fR, \fB\-\-guid\fR
> -This option specifies the local port GUID value
> -with which OpenSM should bind.  OpenSM may be
> -bound to 1 port at a time.
> -If GUID given is 0, OpenSM displays a list
> -of possible port GUIDs and waits for user input.
> -Without -g, OpenSM tries to use the default port.
> -.TP
> -\fB\-l\fR, \fB\-\-lmc\fR
> -This option specifies the subnet's LMC value.
> -The number of LIDs assigned to each port is 2^LMC.
> -The LMC value must be in the range 0-7.
> -LMC values > 0 allow multiple paths between ports.
> -LMC values > 0 should only be used if the subnet
> -topology actually provides multiple paths between
> -ports, i.e. multiple interconnects between switches.
> -Without -l, OpenSM defaults to LMC = 0, which allows
> -one path between any two ports.
> -.TP
> -\fB\-p\fR, \fB\-\-priority\fR
> -This option specifies the SM\'s PRIORITY.
> -This will effect the handover cases, where master
> -is chosen by priority and GUID.  Range goes from 0
> -(default and lowest priority) to 15 (highest).
> -.TP
> -\fB\-smkey\fR
> -This option specifies the SM\'s SM_Key (64 bits).
> -This will effect SM authentication.
> -.TP
> -\fB\-r\fR, \fB\-\-reassign_lids\fR
> -This option causes OpenSM to reassign LIDs to all
> -end nodes. Specifying -r on a running subnet
> -may disrupt subnet traffic.
> -Without -r, OpenSM attempts to preserve existing
> -LID assignments resolving multiple use of same LID.
> -.TP
> -\fB\-R\fR, \fB\-\-routing_engine\fR
> -This option chooses routing engine instead of Min Hop
> -algorithm (default).
> -Supported engines: minhop, updn, file, ftree, lash, dor
> -.TP
> -\fB\-z\fR, \fB\-\-connect_roots\fR
> -This option enforces a routing engine (currently up/down
> -only) to make connectivity between root switches and in
> -this way to be fully IBA complaint. In many cases this can
> -violate "pure" deadlock free algorithm, so use it carefully.
> -.TP
> -\fB\-M\fR, \fB\-\-lid_matrix_file\fR
> -This option specifies the name of the lid matrix dump file
> -from where switch lid matrices (min hops tables will be
> -loaded.
> -.TP
> -\fB\-U\fR, \fB\-\-ucast_file\fR
> -This option specifies the name of the unicast dump file
> -from where switch forwarding tables will be loaded.
> -.TP
> -\fB\-S\fR, \fB\-\-sadb_file\fR
> -This option specifies the name of the SA DB dump file
> -from where SA database will be loaded.
> -.TP
> -\fB\-a\fR, \fB\-\-root_guid_file\fR
> -Set the root nodes for the Up/Down or Fat-Tree routing
> -algorithm to the guids provided in the given file (one to a line).
> -.TP
> -\fB\-u\fR, \fB\-\-cn_guid_file\fR
> -Set the compute nodes for the Fat-Tree routing algorithm
> -to the guids provided in the given file (one to a line).
> -.TP
> -\fB\-o\fR, \fB\-\-once\fR
> -This option causes OpenSM to configure the subnet
> -once, then exit.  Ports remain in the ACTIVE state.
> -.TP
> -\fB\-s\fR, \fB\-\-sweep\fR
> -This option specifies the number of seconds between
> -subnet sweeps.  Specifying -s 0 disables sweeping.
> -Without -s, OpenSM defaults to a sweep interval of
> -10 seconds.
> -.TP
> -\fB\-t\fR, \fB\-\-timeout\fR
> -This option specifies the time in milliseconds
> -used for transaction timeouts.
> -Specifying -t 0 disables timeouts.
> -Without -t, OpenSM defaults to a timeout value of
> -200 milliseconds.
> -.TP
> -\fB\-maxsmps\fR
> -This option specifies the number of VL15 SMP MADs
> -allowed on the wire at any one time.
> -Specifying -maxsmps 0 allows unlimited outstanding
> -SMPs.
> -Without -maxsmps, OpenSM defaults to a maximum of
> -4 outstanding SMPs.
> -.TP
> -\fB\-console [off | local | socket | loopback]\fR
> -This option brings up the OpenSM console (default off).
> -Note that the socket and loopback options will only be available
> -if OpenSM was built with --enable-console-socket.
> -.TP
> -\fB\-console-port\fR <port>
> -Specify an alternate telnet port for the socket console (default 10000).
> -Note that this option only appears if OpenSM was built with
> ---enable-console-socket.
> -.TP
> -\fB\-i\fR, \fB\-ignore-guids\fR <equalize-ignore-guids-file>
> -This option provides the means to define a set of ports
> -(by guid) that will be ignored by the link load
> -equalization algorithm.
> -.TP
> -\fB\-x\fR, \fB\-\-honor_guid2lid\fR
> -This option forces OpenSM to honor the guid2lid file,
> -when it comes out of Standby state, if such file exists
> -under OSM_CACHE_DIR, and is valid.
> -By default, this is FALSE.
> -.TP
> -\fB\-f\fR, \fB\-\-log_file\fR
> -This option defines the log to be the given file.
> -By default, the log goes to /var/log/opensm.log.
> -For the log to go to standard output use -f stdout.
> -.TP
> -\fB\-L\fR, \fB\-\-log_limit\fR <size in MB>
> -This option defines maximal log file size in MB. When
> -specified the log file will be truncated upon reaching
> -this limit.
> -.TP
> -\fB\-e\fR, \fB\-\-erase_log_file\fR
> -This option will cause deletion of the log file
> -(if it previously exists). By default, the log file
> -is accumulative.
> -.TP
> -\fB\-P\fR, \fB\-\-Pconfig\fR
> -This option defines the optional partition configuration file.
> -The default name is \'/etc/opensm/opensm-partitions.conf\'.
> -.TP
> -.BI --prefix_routes_file= path
> -Prefix routes control how the SA responds to path record queries for
> -off-subnet DGIDs.  By default, the SA fails such queries. The
> -.B PREFIX ROUTES
> -section below describes the format of the configuration file.
> -The default path is \fB\%/etc/ofa/opensm\-prefix\-routes.conf\fP.
> -.TP
> -\fB\-Q\fR, \fB\-\-qos\fR
> -This option enables QoS setup. It is disabled by default.
> -.TP
> -\fB\-N\fR, \fB\-\-no_part_enforce\fR
> -This option disables partition enforcement on switch external ports.
> -.TP
> -\fB\-y\fR, \fB\-\-stay_on_fatal\fR
> -This option will cause SM not to exit on fatal initialization
> -issues: if SM discovers duplicated guids or a 12x link with
> -lane reversal badly configured.
> -By default, the SM will exit on these errors.
> -.TP
> -\fB\-B\fR, \fB\-\-daemon\fR
> -Run in daemon mode - OpenSM will run in the background.
> -.TP
> -\fB\-I\fR, \fB\-\-inactive\fR
> -Start SM in inactive rather than init SM state.  This
> -option can be used in conjunction with the perfmgr so as to
> -run a standalone performance manager without SM/SA.  However,
> -this is NOT currently implemented in the performance manager.
> -.TP
> -\fB\-perfmgr\fR
> -Enable the perfmgr.  Only takes effect if --enable-perfmgr was specified at
> -configure time.
> -.TP
> -\fB\-perfmgr_sweep_time_s\fR <seconds>
> -Specify the sweep time for the performance manager in seconds
> -(default is 180 seconds).  Only takes
> -effect if --enable-perfmgr was specified at configure time.
> -.TP
> -.BI --consolidate_ipv6_snm_reqests
> -Consolidate IPv6 Solicited Node Multicast group joins into 1 IB multicast
> -group.
> -.TP
> -\fB\-v\fR, \fB\-\-verbose\fR
> -This option increases the log verbosity level.
> -The -v option may be specified multiple times
> -to further increase the verbosity level.
> -See the -D option for more information about
> -log verbosity.
> -.TP
> -\fB\-V\fR
> -This option sets the maximum verbosity level and
> -forces log flushing.
> -The -V option is equivalent to \'-D 0xFF -d 2\'.
> -See the -D option for more information about
> -log verbosity.
> -.TP
> -\fB\-D\fR
> -This option sets the log verbosity level.
> -A flags field must follow the -D option.
> -A bit set/clear in the flags enables/disables a
> -specific log level as follows:
> -
> - BIT    LOG LEVEL ENABLED
> - ----   -----------------
> - 0x01 - ERROR (error messages)
> - 0x02 - INFO (basic messages, low volume)
> - 0x04 - VERBOSE (interesting stuff, moderate volume)
> - 0x08 - DEBUG (diagnostic, high volume)
> - 0x10 - FUNCS (function entry/exit, very high volume)
> - 0x20 - FRAMES (dumps all SMP and GMP frames)
> - 0x40 - ROUTING (dump FDB routing information)
> - 0x80 - currently unused.
> -
> -Without -D, OpenSM defaults to ERROR + INFO (0x3).
> -Specifying -D 0 disables all messages.
> -Specifying -D 0xFF enables all messages (see -V).
> -High verbosity levels may require increasing
> -the transaction timeout with the -t option.
> -.TP
> -\fB\-d\fR, \fB\-\-debug\fR
> -This option specifies a debug option.
> -These options are not normally needed.
> -The number following -d selects the debug
> -option to enable as follows:
> -
> - OPT   Description
> - ---    -----------------
> - -d0  - Ignore other SM nodes
> - -d1  - Force single threaded dispatching
> - -d2  - Force log flushing after each log message
> - -d3  - Disable multicast support
> -.TP
> -\fB\-h\fR, \fB\-\-help\fR
> -Display this usage info then exit.
> -.TP
> -\fB\-?\fR
> -Display this usage info then exit.
> -
> -.SH ENVIRONMENT VARIABLES
> -.PP
> -The following environment variables control opensm behavior:
> -
> -OSM_TMP_DIR - controls the directory in which the temporary files generated by
> -opensm are created. These files are: opensm-subnet.lst, opensm.fdbs, and
> -opensm.mcfdbs. By default, this directory is /var/log.
> -
> -OSM_CACHE_DIR - opensm stores certain data to the disk such that subsequent
> -runs are consistent. The default directory used is /var/cache/opensm.
> -The following files are included in it:
> -
> - guid2lid - stores the LID range assigned to each GUID
> -
> - opensm.opts - an optional file that holds a complete set of opensm
> -               configuration options
> -
> -.SH NOTES
> -.PP
> -When opensm receives a HUP signal, it starts a new heavy sweep as if a trap was received or a topology change was found.
> -.PP
> -Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for
> -logrotate purposes.
> -
> -.SH PARTITION CONFIGURATION
> -.PP
> -The default name of OpenSM partitions configuration file is
> -\'/etc/ofa/opensm-partitions.conf\'. The default may be changed by using
> ---Pconfig (-P) option with OpenSM.
> -
> -The default partition will be created by OpenSM unconditionally even
> -when partition configuration file does not exist or cannot be accessed.
> -
> -The default partition has P_Key value 0x7fff. OpenSM\'s port will have
> -full membership in default partition. All other end ports will have
> -partial membership.
> -
> -File Format
> -
> -Comments:
> -
> -Line content followed after \'#\' character is comment and ignored by
> -parser.
> -
> -General file format:
> -
> -<Partition Definition>:<PortGUIDs list> ;
> -
> -Partition Definition:
> -
> -[PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
> -
> - PartitionName - string, will be used with logging. When omitted
> -                 empty string will be used.
> - PKey          - P_Key value for this partition. Only low 15 bits will
> -                 be used. When omitted will be autogenerated.
> - flag          - used to indicate IPoIB capability of this partition.
> - defmember=full|limited - specifies default membership for port guid
> -                 list. Default is limited.
> -
> -Currently recognized flags are:
> -
> - ipoib       - indicates that this partition may be used for IPoIB, as
> -               result IPoIB capable MC group will be created.
> - rate=<val>  - specifies rate for this IPoIB MC group
> -               (default is 3 (10GBps))
> - mtu=<val>   - specifies MTU for this IPoIB MC group
> -               (default is 4 (2048))
> - sl=<val>    - specifies SL for this IPoIB MC group
> -               (default is 0)
> - scope=<val> - specifies scope for this IPoIB MC group
> -               (default is 2 (link local)).  Multiple scope settings
> -               are permitted for a partition.
> -
> -Note that values for rate, mtu, and scope should be specified as
> -defined in the IBTA specification (for example, mtu=4 for 2048).
> -
> -PortGUIDs list:
> -
> - PortGUID         - GUID of partition member EndPort. Hexadecimal
> -                    numbers should start from 0x, decimal numbers
> -                    are accepted too.
> - full or limited  - indicates full or limited membership for this
> -                    port.  When omitted (or unrecognized) limited
> -                    membership is assumed.
> -
> -There are two useful keywords for PortGUID definition:
> -
> - - 'ALL' means all end ports in this subnet.
> - - 'SELF' means subnet manager's port.
> -
> -Empty list means no ports in this partition.
> -
> -Notes:
> -
> -White space is permitted between delimiters ('=', ',',':',';').
> -
> -The line can be wrapped after ':' followed after Partition Definition and
> -between.
> -
> -PartitionName does not need to be unique, PKey does need to be unique.
> -If PKey is repeated then those partition configurations will be merged
> -and first PartitionName will be used (see also next note).
> -
> -It is possible to split partition configuration in more than one
> -definition, but then PKey should be explicitly specified (otherwise
> -different PKey values will be generated for those definitions).
> -
> -Examples:
> -
> - Default=0x7fff : ALL, SELF=full ;
> -
> - NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
> -
> - YetAnotherOne = 0x300 : SELF=full ;
> - YetAnotherOne = 0x300 : ALL=limited ;
> -
> - ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
> - # 0x123453, 0x123454 will be limited
> - ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
> - # 0x123456, 0x123457 will be limited
> - ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
> - ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
> - ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
> -
> -
> -Note:
> -
> -The following rule is equivalent to how OpenSM used to run prior to the
> -partition manager:
> -
> - Default=0x7fff,ipoib:ALL=full;
> -
> -.SH QOS CONFIGURATION
> -.PP
> -There are a set of QoS related low-level configuration parameters.
> -All these parameter names are prefixed by "qos_" string. Here is a full
> -list of these parameters:
> -
> - qos_max_vls    - The maximum number of VLs that will be on the subnet
> - qos_high_limit - The limit of High Priority component of VL
> -                  Arbitration table (IBA 7.6.9)
> - qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
> -                  template
> - qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
> -                  template
> -                  Both VL arbitration templates are pairs of
> -                  VL and weight
> - qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
> -                  a list of VLs corresponding to SLs 0-15 (Note
> -                  that VL15 used here means drop this SL)
> -
> -Typical default values (hard-coded in OpenSM initialization) are:
> -
> - qos_max_vls=15
> - qos_high_limit=0
> - qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> - qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> - qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> -
> -The syntax is compatible with rest of OpenSM configuration options and
> -values may be stored in OpenSM config file (cached options file).
> -
> -In addition to the above, we may define separate QoS configuration
> -parameters sets for various target types. As targets, we currently support
> -CAs, routers, switch external ports, and switch's enhanced port 0. The
> -names of such specialized parameters are prefixed by "qos_<type>_"
> -string. Here is a full list of the currently supported sets:
> -
> - qos_ca_  - QoS configuration parameters set for CAs.
> - qos_rtr_ - parameters set for routers.
> - qos_sw0_ - parameters set for switches' port 0.
> - qos_swe_ - parameters set for switches' external ports.
> -
> -Examples:
> - qos_sw0_max_vls=2
> - qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
> - qos_swe_high_limit=0
> -
> -.SH PREFIX ROUTES
> -.PP
> -Prefix routes control how the SA responds to path record queries for
> -off-subnet DGIDs.  By default, the SA fails such queries.
> -Note that IBA does not specify how the SA should obtain off-subnet path
> -record information.
> -The prefix routes configuration is meant as a stop-gap until the
> -specification is completed.
> -.PP
> -Each line in the configuration file is a 64-bit prefix followed by a
> -64-bit GUID, separated by white space.
> -The GUID specifies the router port on the local subnet that will
> -handle the prefix.
> -Blank lines are ignored, as is anything between a \fB#\fP character
> -and the end of the line.
> -The prefix and GUID are both in hex, the leading 0x is optional.
> -Either, or both, can be wild-carded by specifying an
> -asterisk instead of an explicit prefix or GUID.
> -.PP
> -When responding to a path record query for an off-subnet DGID,
> -opensm searches for the first prefix match in the configuration file.
> -Therefore, the order of the lines in the configuration file is important:
> -a wild-carded prefix at the beginning of the configuration file renders
> -all subsequent lines useless.
> -If there is no match, then opensm fails the query.
> -It is legal to repeat prefixes in the configuration file,
> -opensm will return the path to the first available matching router.
> -A configuration file with a single line where both prefix and GUID
> -are wild-carded means that a path record query specifying any
> -off-subnet DGID should return a path to the first available router.
> -This configuration yields the same behaviour formerly achieved by
> -compiling opensm with -DROUTER_EXP.
> -
> -.SH ROUTING
> -.PP
> -OpenSM now offers five routing engines:
> -
> -1.  Min Hop Algorithm - based on the minimum hops to each node where the
> -path length is optimized.
> -
> -2.  UPDN Unicast routing algorithm - also based on the minimum hops to each
> -node, but it is constrained to ranking rules. This algorithm should be chosen
> -if the subnet is not a pure Fat Tree, and deadlock may occur due to a
> -loop in the subnet.
> -
> -3.  Fat Tree Unicast routing algorithm - this algorithm optimizes routing
> -for congestion-free "shift" communication pattern.
> -It should be chosen if a subnet is a symmetrical Fat Trees of various types,
> -not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio.
> -Similar to UPDN, Fat Tree routing is constrained to ranking rules.
> -
> -4. LASH unicast routing algorithm - uses Infiniband virtual layers
> -(SL) to provide deadlock-free shortest-path routing while also
> -distributing the paths between layers. LASH is an alternative
> -deadlock-free topology-agnostic routing algorithm to the non-minimal
> -UPDN algorithm avoiding the use of a potentially congested root node.
> -
> -5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
> -avoids port equalization except for redundant links between the same
> -two switches.  This provides deadlock free routes for hypercubes when
> -the fabric is cabled as a hypercube and for meshes when cabled as a
> -mesh (see details below).
> -
> -OpenSM also supports a file method which
> -can load routes from a table. See \'Modular Routing Engine\' for more
> -information on this.
> -
> -The basic routing algorithm is comprised of two stages:
> -
> -1. MinHop matrix calculation
> -   How many hops are required to get from each port to each LID ?
> -   The algorithm to fill these tables is different if you run standard
> -(min hop) or Up/Down.
> -   For standard routing, a "relaxation" algorithm is used to propagate
> -min hop from every destination LID through neighbor switches
> -   For Up/Down routing, a BFS from every target is used. The BFS tracks link
> -direction (up or down) and avoid steps that will perform up after a down
> -step was used.
> -
> -2. Once MinHop matrices exist, each switch is visited and for each target LID a
> -decision is made as to what port should be used to get to that LID.
> -   This step is common to standard and Up/Down routing. Each port has a
> -counter counting the number of target LIDs going through it.
> -   When there are multiple alternative ports with same MinHop to a LID,
> -the one with less previously assigned ports is selected.
> -   If LMC > 0, more checks are added: Within each group of LIDs assigned to
> -same target port,
> -   a. use only ports which have same MinHop
> -   b. first prefer the ones that go to different systemImageGuid (then
> -the previous LID of the same LMC group)
> -   c. if none - prefer those which go through another NodeGuid
> -   d. fall back to the number of paths method (if all go to same node).
> -
> -Effect of Topology Changes
> -
> -OpenSM will preserve existing routing in any case where there is no change in
> -the fabric switches unless the -r (--reassign_lids) option is specified.
> -
> --r
> -.br
> ---reassign_lids
> -          This option causes OpenSM to reassign LIDs to all
> -          end nodes. Specifying -r on a running subnet
> -          may disrupt subnet traffic.
> -          Without -r, OpenSM attempts to preserve existing
> -          LID assignments resolving multiple use of same LID.
> -
> -If a link is added or removed, OpenSM does not recalculate
> -the routes that do not have to change. A route has to change
> -if the port is no longer UP or no longer the MinHop. When routing changes
> -are performed, the same algorithm for balancing the routes is invoked.
> -
> -In the case of using the file based routing, any topology changes are
> -currently ignored The 'file' routing engine just loads the LFTs from the file
> -specified, with no reaction to real topology. Obviously, this will not be able
> -to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent
> -switches will be skipped. Multicast is not affected by 'file' routing engine
> -(this uses min hop tables).
> -
> -
> -Min Hop Algorithm
> -
> -The Min Hop algorithm is invoked when neither UPDN or the file method are
> -specified.
> -
> -The Min Hop algorithm is divided into two stages: computation of
> -min-hop tables on every switch and LFT output port assignment. Link
> -subscription is also equalized with the ability to override based on
> -port GUID. The latter is supplied by:
> -
> --i <equalize-ignore-guids-file>
> -.br
> --ignore-guids <equalize-ignore-guids-file>
> -          This option provides the means to define a set of ports
> -          (by guid) that will be ignored by the link load
> -          equalization algorithm. Note that only endports (CA,
> -          switch port 0, and router ports) and not switch external
> -          ports are supported.
> -
> -LMC awareness routes based on (remote) system or switch basis.
> -
> -
> -Purpose of UPDN Algorithm
> -
> -The UPDN algorithm is designed to prevent deadlocks from occurring in loops
> -of the subnet. A loop-deadlock is a situation in which it is no longer
> -possible to send data between any two hosts connected through the loop. As
> -such, the UPDN routing algorithm should be used if the subnet is not a pure
> -Fat Tree, and one of its loops may experience a deadlock (due, for example,
> -to high pressure).
> -
> -The UPDN algorithm is based on the following main stages:
> -
> -1.  Auto-detect root nodes - based on the CA hop length from any switch in
> -the subnet, a statistical histogram is built for each switch (hop num vs
> -number of occurrences). If the histogram reflects a specific column (higher
> -than others) for a certain node, then it is marked as a root node. Since
> -the algorithm is statistical, it may not find any root nodes. The list of
> -the root nodes found by this auto-detect stage is used by the ranking
> -process stage.
> -
> -    Note 1: The user can override the node list manually.
> -    Note 2: If this stage cannot find any root nodes, and the user did
> -            not specify a guid list file, OpenSM defaults back to the
> -            Min Hop routing algorithm.
> -
> -2.  Ranking process - All root switch nodes (found in stage 1) are assigned
> -a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the
> -subnet are ranked incrementally. This ranking aids in the process of enforcing
> -rules that ensure loop-free paths.
> -
> -3.  Min Hop Table setting - after ranking is done, a BFS algorithm is run from
> -each (CA or switch) node in the subnet. During the BFS process, the FDB table
> -of each switch node traversed by BFS is updated, in reference to the starting
> -node, based on the ranking rules and guid values.
> -
> -At the end of the process, the updated FDB tables ensure loop-free paths
> -through the subnet.
> -
> -Note: Up/Down routing does not allow LID routing communication between
> -switches that are located inside spine "switch systems".
> -The reason is that there is no way to allow a LID route between them
> -that does not break the Up/Down rule.
> -One ramification of this is that you cannot run SM on switches other
> -than the leaf switches of the fabric.
> -
> -
> -UPDN Algorithm Usage
> -
> -Activation through OpenSM
> -
> -Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
> -Use '-a <root_guid_file>' for adding an UPDN guid file that contains the
> -root nodes for ranking.
> -If the `-a' option is not used, OpenSM uses its auto-detect root nodes
> -algorithm.
> -
> -Notes on the guid list file:
> -
> -1.   A valid guid file specifies one guid in each line. Lines with an invalid
> -format will be discarded.
> -.br
> -2.   The user should specify the root switch guids. However, it is also
> -possible to specify CA guids; OpenSM will use the guid of the switch (if
> -it exists) that connects the CA to the subnet as a root node.
> -
> -
> -Fat-tree Routing Algorithm
> -
> -The fat-tree algorithm optimizes routing for "shift" communication pattern.
> -It should be chosen if a subnet is a symmetrical or almost symmetrical
> -fat-tree of various types.
> -It supports not just K-ary-N-Trees, by handling for non-constant K,
> -cases where not all leafs (CAs) are present, any CBB ratio.
> -As in UPDN, fat-tree also prevents credit-loop-deadlocks.
> -
> -If the root guid file is not provided ('-a' or '--root_guid_file' options),
> -the topology has to be pure fat-tree that complies with the following rules:
> -  - Tree rank should be between two and eight (inclusively)
> -  - Switches of the same rank should have the same number
> -    of UP-going port groups*, unless they are root switches,
> -    in which case the shouldn't have UP-going ports at all.
> -  - Switches of the same rank should have the same number
> -    of DOWN-going port groups, unless they are leaf switches.
> -  - Switches of the same rank should have the same number
> -    of ports in each UP-going port group.
> -  - Switches of the same rank should have the same number
> -    of ports in each DOWN-going port group.
> -  - All the CAs have to be at the same tree level (rank).
> -
> -If the root guid file is provided, the topology doesn't have to be pure
> -fat-tree, and it should only comply with the following rules:
> -  - Tree rank should be between two and eight (inclusively)
> -  - All the Compute Nodes** have to be at the same tree level (rank).
> -    Note that non-compute node CAs are allowed here to be at different
> -    tree ranks.
> -
> -* ports that are connected to the same remote switch are referenced as
> -\'port group\'.
> -
> -** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\'
> -OpenSM options.
> -
> -Topologies that do not comply cause a fallback to min hop routing.
> -Note that this can also occur on link failures which cause the topology
> -to no longer be "pure" fat-tree.
> -
> -Note that although fat-tree algorithm supports trees with non-integer CBB
> -ratio, the routing will not be as balanced as in case of integer CBB ratio.
> -In addition to this, although the algorithm allows leaf switches to have any
> -number of CAs, the closer the tree is to be fully populated, the more
> -effective the "shift" communication pattern will be.
> -In general, even if the root list is provided, the closer the topology to a
> -pure and symmetrical fat-tree, the more optimal the routing will be.
> -
> -The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
> -in the same directory where the OpenSM log resides. This ordering file provides
> -the CN order that may be used to create efficient communication pattern, that
> -will match the routing tables.
> -
> -Activation through OpenSM
> -
> -Use '-R ftree' option to activate the fat-tree algorithm.
> -Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option
> -is not used, routing algorithm will detect roots automatically.
> -Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option
> -is not used, all the CAs are considered as compute nodes.
> -
> -Note: LMC > 0 is not supported by fat-tree routing. If this is
> -specified, the default routing algorithm is invoked instead.
> -
> -
> -LASH Routing Algorithm
> -
> -LASH is an acronym for LAyered SHortest Path Routing. It is a
> -deterministic shortest path routing algorithm that enables topology
> -agnostic deadlock-free routing within communication networks.
> -
> -When computing the routing function, LASH analyzes the network
> -topology for the shortest-path routes between all pairs of sources /
> -destinations and groups these paths into virtual layers in such a way
> -as to avoid deadlock.
> -
> -Note LASH analyzes routes and ensures deadlock freedom between switch
> -pairs. The link from HCA between and switch does not need virtual
> -layers as deadlock will not arise between switch and HCA.
> -
> -In more detail, the algorithm works as follows:
> -
> -1) LASH determines the shortest-path between all pairs of source /
> -destination switches. Note, LASH ensures the same SL is used for all
> -SRC/DST - DST/SRC pairs and there is no guarantee that the return
> -path for a given DST/SRC will be the reverse of the route SRC/DST.
> -
> -2) LASH then begins an SL assignment process where a route is assigned
> -to a layer (SL) if the addition of that route does not cause deadlock
> -within that layer. This is achieved by maintaining and analysing a
> -channel dependency graph for each layer. Once the potential addition
> -of a path could lead to deadlock, LASH opens a new layer and continues
> -the process.
> -
> -3) Once this stage has been completed, it is highly likely that the
> -first layers processed will contain more paths than the latter ones.
> -To better balance the use of layers, LASH moves paths from one layer
> -to another so that the number of paths in each layer averages out.
> -
> -Note, the implementation of LASH in opensm attempts to use as few layers
> -as possible. This number can be less than the number of actual layers
> -available.
> -
> -In general LASH is a very flexible algorithm. It can, for example,
> -reduce to Dimension Order Routing in certain topologies, it is topology
> -agnostic and fares well in the face of faults.
> -
> -It has been shown that for both regular and irregular topologies, LASH
> -outperforms Up/Down. The reason for this is that LASH distributes the
> -traffic more evenly through a network, avoiding the bottleneck issues
> -related to a root node and always routes shortest-path.
> -
> -The algorithm was developed by Simula Research Laboratory.
> -
> -
> -Use '-R lash -Q ' option to activate the LASH algorithm.
> -
> -Note: QoS support has to be turned on in order that SL/VL mappings are
> -used.
> -
> -Note: LMC > 0 is not supported by the LASH routing. If this is
> -specified, the default routing algorithm is invoked instead.
> -
> -
> -DOR Routing Algorithm
> -
> -The Dimension Order Routing algorithm is based on the Min Hop
> -algorithm and so uses shortest paths.  Instead of spreading traffic
> -out across different paths with the same shortest distance, it chooses
> -among the available shortest paths based on an ordering of dimensions.
> -Each port must be consistently cabled to represent a hypercube
> -dimension or a mesh dimension.  Paths are grown from a destination
> -back to a source using the lowest dimension (port) of available paths
> -at each step.  This provides the ordering necessary to avoid deadlock.
> -When there are multiple links between any two switches, they still
> -represent only one dimension and traffic is balanced across them
> -unless port equalization is turned off.  In the case of hypercubes,
> -the same port must be used throughout the fabric to represent the
> -hypercube dimension and match on both ends of the cable.  In the case
> -of meshes, the dimension should consistently use the same pair of
> -ports, one port on one end of the cable, and the other port on the
> -other end, continuing along the mesh dimension.
> -
> -Use '-R dor' option to activate the DOR algorithm.
> -
> -
> -Routing References
> -
> -To learn more about deadlock-free routing, see the article
> -"Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
> -by William J Dally and Charles L Seitz (1985).
> -
> -To learn more about the up/down algorithm, see the article
> -"Effective Strategy to Compute Forwarding Tables for InfiniBand Networks"
> -by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the
> -Universidad Politecnica de Valencia.
> -
> -To learn more about LASH and the flexibility behind it, the requirement
> -for layers, performance comparisons to other algorithms, see the
> -following articles:
> -
> -"Layered Routing in Irregular Networks", Lysne et al, IEEE
> -Transactions on Parallel and Distributed Systems, VOL.16, No12,
> -December 2005.
> -
> -"Routing for the ASI Fabric Manager", Solheim et al. IEEE
> -Communications Magazine, Vol.44, No.7, July 2006.
> -
> -"Layered Shortest Path (LASH) Routing in Irregular System Area
> -Networks", Skeie et al. IEEE Computer Society Communication
> -Architecture for Clusters 2002.
> -
> -
> -Modular Routine Engine
> -
> -Modular routing engine structure allows for the ease of
> -"plugging" new routing modules.
> -
> -Currently, only unicast callbacks are supported. Multicast
> -can be added later.
> -
> -One existing routing module is up-down "updn", which may be
> -activated with '-R updn' option (instead of old '-u').
> -
> -General usage is:
> -$ opensm -R 'module-name'
> -
> -There is also a trivial routing module which is able
> -to load LFT tables from a dump file.
> -
> -Main features:
> -
> - - this will load switch LFTs and/or LID matrices (min hops tables)
> - - this will load switch LFTs according to the path entries introduced
> -   in the dump file
> - - no additional checks will be performed (such as "is port connected",
> -   etc.)
> - - in case when fabric LIDs were changed this will try to reconstruct
> -   LFTs correctly if endport GUIDs are represented in the dump file
> -   (in order to disable this, GUIDs may be removed from the dump file
> -    or zeroed)
> -
> -The dump file format is compatible with output of 'ibroute' util and for
> -whole fabric can be generated with dump_lfts.sh script.
> -
> -To activate file based routing module, use:
> -
> -  opensm -R file -U /path/to/dump_file
> -
> -If the dump_file is not found or is in error, the default routing
> -algorithm is utilized.
> -
> -The ability to dump switch lid matrices (aka min hops tables) to file and
> -later to load these is also supported.
> -
> -The usage is similar to unicast forwarding tables loading from dump
> -file (introduced by 'file' routing engine), but new lid matrix file
> -name should be specified by -M or --lid_matrix_file option. For example:
> -
> -  opensm -R file -M ./opensm-lid-matrix.dump
> -
> -The dump file is named \'opensm-lid-matrix.dump\' and will be generated
> -in standard opensm dump directory (/var/log by default) when
> -OSM_LOG_ROUTING logging flag is set.
> -
> -When routing engine 'file' is activated, but dump file is not specified
> -or not cannot be open default lid matrix algorithm will be used.
> -
> -There is also a switch forwarding tables dumper which generates
> -a file compatible with dump_lfts.sh output. This file can be used
> -as input for forwarding tables loading by 'file' routing engine.
> -Both or one of options -U and -M can be specified together with \'-R file\'.
> -
> -.SH FILES
> -.TP
> -.B /etc/opensm/prefix-routes.conf
> -default prefix routes file.
> -
> -.SH AUTHORS
> -.TP
> -Hal Rosenstock
> -.RI < hal at xsigo.com >
> -.TP
> -Sasha Khapyorsky
> -.RI < sashak at voltaire.com >
> -.TP
> -Eitan Zahavi
> -.RI < eitan at mellanox.co.il >
> -.TP
> -Yevgeny Kliteynik
> -.RI < kliteyn at mellanox.co.il >
> -.TP
> -Thomas Sodring
> -.RI < tsodring at simula.no >
> diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
> new file mode 100644
> index 0000000..115ab56
> --- /dev/null
> +++ b/opensm/man/opensm.8.in
> @@ -0,0 +1,941 @@
> +.TH OPENSM 8 "Aug 16, 2007" "OpenIB" "OpenIB Management"
> +
> +.SH NAME
> +opensm \- InfiniBand subnet manager and administration (SM/SA)
> +
> +.SH SYNOPSIS
> +.B opensm
> +[\-c(ache-options)] [\-g(uid)[=]<GUID in hex>] [\-l(mc) <LMC>]
> +[\-p(riority) <PRIORITY>] [\-smkey <SM_Key>] [\-r(eassign_lids)]
> +[\-R <engine name> | \-\-routing_engine <engine name>]
> +[\-z | \-\-connect_roots]
> +[\-M <file name> | \-\-lid_matrix_file <file name>]
> +[\-U <file name> | \-\-ucast_file <file name>]
> +[\-S | \-\-sadb_file <file name>] [\-a | \-\-root_guid_file <path to file>]
> +[\-u | \-\-cn_guid_file <path to file>] [\-o(nce)] [\-s(weep) <interval>]
> +[\-t(imeout) <milliseconds>] [\-maxsmps <number>]
> +[\-console [off | local | socket | loopback]] [\-console-port <port>]
> +[\-i(gnore-guids) <equalize-ignore-guids-file>] [\-f | \-\-log_file]
> +[\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)] [\-P(config)]
> +[\-Q | \-\-qos] [\-N | \-\-no_part_enforce] [\-y | \-\-stay_on_fatal]
> +[\-B | \-\-daemon] [\-I | \-\-inactive]
> +[\-\-perfmgr] [\-\-perfmgr_sweep_time_s <seconds>]
> +[\-\-prefix_routes_file <path>]
> +[\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>] [\-h(elp)] [\-?]
> +
> +.SH DESCRIPTION
> +.PP
> +opensm is an InfiniBand compliant Subnet Manager and Administration,
> +and runs on top of OpenIB.
> +
> +opensm provides an implementation of an InfiniBand Subnet Manager and
> +Administration. Such a software entity is required to run for in order
> +to initialize the InfiniBand hardware (at least one per each
> +InfiniBand subnet).
> +
> +opensm also now contains an experimental version of a performance
> +manager as well.
> +
> +opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
> +fabric, initialize it, and sweep occasionally for changes.
> +
> +opensm attaches to a specific IB port on the local machine and configures only
> +the fabric connected to it. (If the local machine has other IB ports,
> +opensm will ignore the fabrics connected to those other ports). If no port is
> +specified, it will select the first "best" available port.
> +
> +opensm can present the available ports and prompt for a port number to
> +attach to.
> +
> +By default, the run is logged to two files: /var/log/messages and /var/log/opensm.log.
> +The first file will register only general major events, whereas the second
> +will include details of reported errors. All errors reported in this second
> +file should be treated as indicators of IB fabric health issues.
> +(Note that when a fatal and non-recoverable error occurs, opensm will exit.)
> +Both log files should include the message "SUBNET UP" if opensm was able to
> +setup the subnet correctly.
> +
> +.SH OPTIONS
> +
> +.PP
> +.TP
> +\fB\-c\fR, \fB\-\-cache-options\fR
> +Write out a list of all tunable OpenSM parameters,
> +including their current values from the command line
> +as well as defaults for others, into the file
> +OSM_CACHE_DIR/opensm.opts (OSM_CACHE_DIR defaults to
> +/var/cache/opensm if the corresponding environment
> +variable is not set). The options file is then
> +used for subsequent OpenSM invocations but any
> +command line options take precedence.
> +.TP
> +\fB\-g\fR, \fB\-\-guid\fR
> +This option specifies the local port GUID value
> +with which OpenSM should bind.  OpenSM may be
> +bound to 1 port at a time.
> +If GUID given is 0, OpenSM displays a list
> +of possible port GUIDs and waits for user input.
> +Without -g, OpenSM tries to use the default port.
> +.TP
> +\fB\-l\fR, \fB\-\-lmc\fR
> +This option specifies the subnet's LMC value.
> +The number of LIDs assigned to each port is 2^LMC.
> +The LMC value must be in the range 0-7.
> +LMC values > 0 allow multiple paths between ports.
> +LMC values > 0 should only be used if the subnet
> +topology actually provides multiple paths between
> +ports, i.e. multiple interconnects between switches.
> +Without -l, OpenSM defaults to LMC = 0, which allows
> +one path between any two ports.
> +.TP
> +\fB\-p\fR, \fB\-\-priority\fR
> +This option specifies the SM\'s PRIORITY.
> +This will effect the handover cases, where master
> +is chosen by priority and GUID.  Range goes from 0
> +(default and lowest priority) to 15 (highest).
> +.TP
> +\fB\-smkey\fR
> +This option specifies the SM\'s SM_Key (64 bits).
> +This will effect SM authentication.
> +.TP
> +\fB\-r\fR, \fB\-\-reassign_lids\fR
> +This option causes OpenSM to reassign LIDs to all
> +end nodes. Specifying -r on a running subnet
> +may disrupt subnet traffic.
> +Without -r, OpenSM attempts to preserve existing
> +LID assignments resolving multiple use of same LID.
> +.TP
> +\fB\-R\fR, \fB\-\-routing_engine\fR
> +This option chooses routing engine instead of Min Hop
> +algorithm (default).
> +Supported engines: minhop, updn, file, ftree, lash, dor
> +.TP
> +\fB\-z\fR, \fB\-\-connect_roots\fR
> +This option enforces a routing engine (currently up/down
> +only) to make connectivity between root switches and in
> +this way to be fully IBA complaint. In many cases this can
> +violate "pure" deadlock free algorithm, so use it carefully.
> +.TP
> +\fB\-M\fR, \fB\-\-lid_matrix_file\fR
> +This option specifies the name of the lid matrix dump file
> +from where switch lid matrices (min hops tables will be
> +loaded.
> +.TP
> +\fB\-U\fR, \fB\-\-ucast_file\fR
> +This option specifies the name of the unicast dump file
> +from where switch forwarding tables will be loaded.
> +.TP
> +\fB\-S\fR, \fB\-\-sadb_file\fR
> +This option specifies the name of the SA DB dump file
> +from where SA database will be loaded.
> +.TP
> +\fB\-a\fR, \fB\-\-root_guid_file\fR
> +Set the root nodes for the Up/Down or Fat-Tree routing
> +algorithm to the guids provided in the given file (one to a line).
> +.TP
> +\fB\-u\fR, \fB\-\-cn_guid_file\fR
> +Set the compute nodes for the Fat-Tree routing algorithm
> +to the guids provided in the given file (one to a line).
> +.TP
> +\fB\-o\fR, \fB\-\-once\fR
> +This option causes OpenSM to configure the subnet
> +once, then exit.  Ports remain in the ACTIVE state.
> +.TP
> +\fB\-s\fR, \fB\-\-sweep\fR
> +This option specifies the number of seconds between
> +subnet sweeps.  Specifying -s 0 disables sweeping.
> +Without -s, OpenSM defaults to a sweep interval of
> +10 seconds.
> +.TP
> +\fB\-t\fR, \fB\-\-timeout\fR
> +This option specifies the time in milliseconds
> +used for transaction timeouts.
> +Specifying -t 0 disables timeouts.
> +Without -t, OpenSM defaults to a timeout value of
> +200 milliseconds.
> +.TP
> +\fB\-maxsmps\fR
> +This option specifies the number of VL15 SMP MADs
> +allowed on the wire at any one time.
> +Specifying -maxsmps 0 allows unlimited outstanding
> +SMPs.
> +Without -maxsmps, OpenSM defaults to a maximum of
> +4 outstanding SMPs.
> +.TP
> +\fB\-console [off | local | socket | loopback]\fR
> +This option brings up the OpenSM console (default off).
> +Note that the socket and loopback options will only be available
> +if OpenSM was built with --enable-console-socket.
> +.TP
> +\fB\-console-port\fR <port>
> +Specify an alternate telnet port for the socket console (default 10000).
> +Note that this option only appears if OpenSM was built with
> +--enable-console-socket.
> +.TP
> +\fB\-i\fR, \fB\-ignore-guids\fR <equalize-ignore-guids-file>
> +This option provides the means to define a set of ports
> +(by guid) that will be ignored by the link load
> +equalization algorithm.
> +.TP
> +\fB\-x\fR, \fB\-\-honor_guid2lid\fR
> +This option forces OpenSM to honor the guid2lid file,
> +when it comes out of Standby state, if such file exists
> +under OSM_CACHE_DIR, and is valid.
> +By default, this is FALSE.
> +.TP
> +\fB\-f\fR, \fB\-\-log_file\fR
> +This option defines the log to be the given file.
> +By default, the log goes to /var/log/opensm.log.
> +For the log to go to standard output use -f stdout.
> +.TP
> +\fB\-L\fR, \fB\-\-log_limit\fR <size in MB>
> +This option defines maximal log file size in MB. When
> +specified the log file will be truncated upon reaching
> +this limit.
> +.TP
> +\fB\-e\fR, \fB\-\-erase_log_file\fR
> +This option will cause deletion of the log file
> +(if it previously exists). By default, the log file
> +is accumulative.
> +.TP
> +\fB\-P\fR, \fB\-\-Pconfig\fR
> +This option defines the optional partition configuration file.
> +The default name is \fB\%@CONF_DIR@/@PARTITION_CONFIG_FILE@\fP.
> +.TP
> +.BI --prefix_routes_file= path
> +Prefix routes control how the SA responds to path record queries for
> +off-subnet DGIDs.  By default, the SA fails such queries. The
> +.B PREFIX ROUTES
> +section below describes the format of the configuration file.
> +The default path is \fB\%@CONF_DIR@/prefix\-routes.conf\fP.
> +.TP
> +\fB\-Q\fR, \fB\-\-qos\fR
> +This option enables QoS setup. It is disabled by default.
> +.TP
> +\fB\-N\fR, \fB\-\-no_part_enforce\fR
> +This option disables partition enforcement on switch external ports.
> +.TP
> +\fB\-y\fR, \fB\-\-stay_on_fatal\fR
> +This option will cause SM not to exit on fatal initialization
> +issues: if SM discovers duplicated guids or a 12x link with
> +lane reversal badly configured.
> +By default, the SM will exit on these errors.
> +.TP
> +\fB\-B\fR, \fB\-\-daemon\fR
> +Run in daemon mode - OpenSM will run in the background.
> +.TP
> +\fB\-I\fR, \fB\-\-inactive\fR
> +Start SM in inactive rather than init SM state.  This
> +option can be used in conjunction with the perfmgr so as to
> +run a standalone performance manager without SM/SA.  However,
> +this is NOT currently implemented in the performance manager.
> +.TP
> +\fB\-perfmgr\fR
> +Enable the perfmgr.  Only takes effect if --enable-perfmgr was specified at
> +configure time.
> +.TP
> +\fB\-perfmgr_sweep_time_s\fR <seconds>
> +Specify the sweep time for the performance manager in seconds
> +(default is 180 seconds).  Only takes
> +effect if --enable-perfmgr was specified at configure time.
> +.TP
> +.BI --consolidate_ipv6_snm_reqests
> +Consolidate IPv6 Solicited Node Multicast group joins into 1 IB multicast
> +group.
> +.TP
> +\fB\-v\fR, \fB\-\-verbose\fR
> +This option increases the log verbosity level.
> +The -v option may be specified multiple times
> +to further increase the verbosity level.
> +See the -D option for more information about
> +log verbosity.
> +.TP
> +\fB\-V\fR
> +This option sets the maximum verbosity level and
> +forces log flushing.
> +The -V option is equivalent to \'-D 0xFF -d 2\'.
> +See the -D option for more information about
> +log verbosity.
> +.TP
> +\fB\-D\fR
> +This option sets the log verbosity level.
> +A flags field must follow the -D option.
> +A bit set/clear in the flags enables/disables a
> +specific log level as follows:
> +
> + BIT    LOG LEVEL ENABLED
> + ----   -----------------
> + 0x01 - ERROR (error messages)
> + 0x02 - INFO (basic messages, low volume)
> + 0x04 - VERBOSE (interesting stuff, moderate volume)
> + 0x08 - DEBUG (diagnostic, high volume)
> + 0x10 - FUNCS (function entry/exit, very high volume)
> + 0x20 - FRAMES (dumps all SMP and GMP frames)
> + 0x40 - ROUTING (dump FDB routing information)
> + 0x80 - currently unused.
> +
> +Without -D, OpenSM defaults to ERROR + INFO (0x3).
> +Specifying -D 0 disables all messages.
> +Specifying -D 0xFF enables all messages (see -V).
> +High verbosity levels may require increasing
> +the transaction timeout with the -t option.
> +.TP
> +\fB\-d\fR, \fB\-\-debug\fR
> +This option specifies a debug option.
> +These options are not normally needed.
> +The number following -d selects the debug
> +option to enable as follows:
> +
> + OPT   Description
> + ---    -----------------
> + -d0  - Ignore other SM nodes
> + -d1  - Force single threaded dispatching
> + -d2  - Force log flushing after each log message
> + -d3  - Disable multicast support
> +.TP
> +\fB\-h\fR, \fB\-\-help\fR
> +Display this usage info then exit.
> +.TP
> +\fB\-?\fR
> +Display this usage info then exit.
> +
> +.SH ENVIRONMENT VARIABLES
> +.PP
> +The following environment variables control opensm behavior:
> +
> +OSM_TMP_DIR - controls the directory in which the temporary files generated by
> +opensm are created. These files are: opensm-subnet.lst, opensm.fdbs, and
> +opensm.mcfdbs. By default, this directory is /var/log.
> +
> +OSM_CACHE_DIR - opensm stores certain data to the disk such that subsequent
> +runs are consistent. The default directory used is /var/cache/opensm.
> +The following files are included in it:
> +
> + guid2lid - stores the LID range assigned to each GUID
> +
> + opensm.opts - an optional file that holds a complete set of opensm
> +               configuration options
> +
> +.SH NOTES
> +.PP
> +When opensm receives a HUP signal, it starts a new heavy sweep as if a trap was received or a topology change was found.
> +.PP
> +Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for
> +logrotate purposes.
> +
> +.SH PARTITION CONFIGURATION
> +.PP
> +The default name of OpenSM partitions configuration file is
> +\fB\%@CONF_DIR@/@PARTITION_CONFIG_FILE@\fP. The default may be changed by using
> +--Pconfig (-P) option with OpenSM.
> +
> +The default partition will be created by OpenSM unconditionally even
> +when partition configuration file does not exist or cannot be accessed.
> +
> +The default partition has P_Key value 0x7fff. OpenSM\'s port will have
> +full membership in default partition. All other end ports will have
> +partial membership.
> +
> +File Format
> +
> +Comments:
> +
> +Line content followed after \'#\' character is comment and ignored by
> +parser.
> +
> +General file format:
> +
> +<Partition Definition>:<PortGUIDs list> ;
> +
> +Partition Definition:
> +
> +[PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
> +
> + PartitionName - string, will be used with logging. When omitted
> +                 empty string will be used.
> + PKey          - P_Key value for this partition. Only low 15 bits will
> +                 be used. When omitted will be autogenerated.
> + flag          - used to indicate IPoIB capability of this partition.
> + defmember=full|limited - specifies default membership for port guid
> +                 list. Default is limited.
> +
> +Currently recognized flags are:
> +
> + ipoib       - indicates that this partition may be used for IPoIB, as
> +               result IPoIB capable MC group will be created.
> + rate=<val>  - specifies rate for this IPoIB MC group
> +               (default is 3 (10GBps))
> + mtu=<val>   - specifies MTU for this IPoIB MC group
> +               (default is 4 (2048))
> + sl=<val>    - specifies SL for this IPoIB MC group
> +               (default is 0)
> + scope=<val> - specifies scope for this IPoIB MC group
> +               (default is 2 (link local)).  Multiple scope settings
> +               are permitted for a partition.
> +
> +Note that values for rate, mtu, and scope should be specified as
> +defined in the IBTA specification (for example, mtu=4 for 2048).
> +
> +PortGUIDs list:
> +
> + PortGUID         - GUID of partition member EndPort. Hexadecimal
> +                    numbers should start from 0x, decimal numbers
> +                    are accepted too.
> + full or limited  - indicates full or limited membership for this
> +                    port.  When omitted (or unrecognized) limited
> +                    membership is assumed.
> +
> +There are two useful keywords for PortGUID definition:
> +
> + - 'ALL' means all end ports in this subnet.
> + - 'SELF' means subnet manager's port.
> +
> +Empty list means no ports in this partition.
> +
> +Notes:
> +
> +White space is permitted between delimiters ('=', ',',':',';').
> +
> +The line can be wrapped after ':' followed after Partition Definition and
> +between.
> +
> +PartitionName does not need to be unique, PKey does need to be unique.
> +If PKey is repeated then those partition configurations will be merged
> +and first PartitionName will be used (see also next note).
> +
> +It is possible to split partition configuration in more than one
> +definition, but then PKey should be explicitly specified (otherwise
> +different PKey values will be generated for those definitions).
> +
> +Examples:
> +
> + Default=0x7fff : ALL, SELF=full ;
> +
> + NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
> +
> + YetAnotherOne = 0x300 : SELF=full ;
> + YetAnotherOne = 0x300 : ALL=limited ;
> +
> + ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
> + # 0x123453, 0x123454 will be limited
> + ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
> + # 0x123456, 0x123457 will be limited
> + ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
> + ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
> + ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
> +
> +
> +Note:
> +
> +The following rule is equivalent to how OpenSM used to run prior to the
> +partition manager:
> +
> + Default=0x7fff,ipoib:ALL=full;
> +
> +.SH QOS CONFIGURATION
> +.PP
> +There are a set of QoS related low-level configuration parameters.
> +All these parameter names are prefixed by "qos_" string. Here is a full
> +list of these parameters:
> +
> + qos_max_vls    - The maximum number of VLs that will be on the subnet
> + qos_high_limit - The limit of High Priority component of VL
> +                  Arbitration table (IBA 7.6.9)
> + qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
> +                  template
> + qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
> +                  template
> +                  Both VL arbitration templates are pairs of
> +                  VL and weight
> + qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
> +                  a list of VLs corresponding to SLs 0-15 (Note
> +                  that VL15 used here means drop this SL)
> +
> +Typical default values (hard-coded in OpenSM initialization) are:
> +
> + qos_max_vls=15
> + qos_high_limit=0
> + qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> + qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> + qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> +
> +The syntax is compatible with rest of OpenSM configuration options and
> +values may be stored in OpenSM config file (cached options file).
> +
> +In addition to the above, we may define separate QoS configuration
> +parameters sets for various target types. As targets, we currently support
> +CAs, routers, switch external ports, and switch's enhanced port 0. The
> +names of such specialized parameters are prefixed by "qos_<type>_"
> +string. Here is a full list of the currently supported sets:
> +
> + qos_ca_  - QoS configuration parameters set for CAs.
> + qos_rtr_ - parameters set for routers.
> + qos_sw0_ - parameters set for switches' port 0.
> + qos_swe_ - parameters set for switches' external ports.
> +
> +Examples:
> + qos_sw0_max_vls=2
> + qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
> + qos_swe_high_limit=0
> +
> +.SH PREFIX ROUTES
> +.PP
> +Prefix routes control how the SA responds to path record queries for
> +off-subnet DGIDs.  By default, the SA fails such queries.
> +Note that IBA does not specify how the SA should obtain off-subnet path
> +record information.
> +The prefix routes configuration is meant as a stop-gap until the
> +specification is completed.
> +.PP
> +Each line in the configuration file is a 64-bit prefix followed by a
> +64-bit GUID, separated by white space.
> +The GUID specifies the router port on the local subnet that will
> +handle the prefix.
> +Blank lines are ignored, as is anything between a \fB#\fP character
> +and the end of the line.
> +The prefix and GUID are both in hex, the leading 0x is optional.
> +Either, or both, can be wild-carded by specifying an
> +asterisk instead of an explicit prefix or GUID.
> +.PP
> +When responding to a path record query for an off-subnet DGID,
> +opensm searches for the first prefix match in the configuration file.
> +Therefore, the order of the lines in the configuration file is important:
> +a wild-carded prefix at the beginning of the configuration file renders
> +all subsequent lines useless.
> +If there is no match, then opensm fails the query.
> +It is legal to repeat prefixes in the configuration file,
> +opensm will return the path to the first available matching router.
> +A configuration file with a single line where both prefix and GUID
> +are wild-carded means that a path record query specifying any
> +off-subnet DGID should return a path to the first available router.
> +This configuration yields the same behaviour formerly achieved by
> +compiling opensm with -DROUTER_EXP.
> +
> +.SH ROUTING
> +.PP
> +OpenSM now offers five routing engines:
> +
> +1.  Min Hop Algorithm - based on the minimum hops to each node where the
> +path length is optimized.
> +
> +2.  UPDN Unicast routing algorithm - also based on the minimum hops to each
> +node, but it is constrained to ranking rules. This algorithm should be chosen
> +if the subnet is not a pure Fat Tree, and deadlock may occur due to a
> +loop in the subnet.
> +
> +3.  Fat Tree Unicast routing algorithm - this algorithm optimizes routing
> +for congestion-free "shift" communication pattern.
> +It should be chosen if a subnet is a symmetrical Fat Trees of various types,
> +not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio.
> +Similar to UPDN, Fat Tree routing is constrained to ranking rules.
> +
> +4. LASH unicast routing algorithm - uses Infiniband virtual layers
> +(SL) to provide deadlock-free shortest-path routing while also
> +distributing the paths between layers. LASH is an alternative
> +deadlock-free topology-agnostic routing algorithm to the non-minimal
> +UPDN algorithm avoiding the use of a potentially congested root node.
> +
> +5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
> +avoids port equalization except for redundant links between the same
> +two switches.  This provides deadlock free routes for hypercubes when
> +the fabric is cabled as a hypercube and for meshes when cabled as a
> +mesh (see details below).
> +
> +OpenSM also supports a file method which
> +can load routes from a table. See \'Modular Routing Engine\' for more
> +information on this.
> +
> +The basic routing algorithm is comprised of two stages:
> +
> +1. MinHop matrix calculation
> +   How many hops are required to get from each port to each LID ?
> +   The algorithm to fill these tables is different if you run standard
> +(min hop) or Up/Down.
> +   For standard routing, a "relaxation" algorithm is used to propagate
> +min hop from every destination LID through neighbor switches
> +   For Up/Down routing, a BFS from every target is used. The BFS tracks link
> +direction (up or down) and avoid steps that will perform up after a down
> +step was used.
> +
> +2. Once MinHop matrices exist, each switch is visited and for each target LID a
> +decision is made as to what port should be used to get to that LID.
> +   This step is common to standard and Up/Down routing. Each port has a
> +counter counting the number of target LIDs going through it.
> +   When there are multiple alternative ports with same MinHop to a LID,
> +the one with less previously assigned ports is selected.
> +   If LMC > 0, more checks are added: Within each group of LIDs assigned to
> +same target port,
> +   a. use only ports which have same MinHop
> +   b. first prefer the ones that go to different systemImageGuid (then
> +the previous LID of the same LMC group)
> +   c. if none - prefer those which go through another NodeGuid
> +   d. fall back to the number of paths method (if all go to same node).
> +
> +Effect of Topology Changes
> +
> +OpenSM will preserve existing routing in any case where there is no change in
> +the fabric switches unless the -r (--reassign_lids) option is specified.
> +
> +-r
> +.br
> +--reassign_lids
> +          This option causes OpenSM to reassign LIDs to all
> +          end nodes. Specifying -r on a running subnet
> +          may disrupt subnet traffic.
> +          Without -r, OpenSM attempts to preserve existing
> +          LID assignments resolving multiple use of same LID.
> +
> +If a link is added or removed, OpenSM does not recalculate
> +the routes that do not have to change. A route has to change
> +if the port is no longer UP or no longer the MinHop. When routing changes
> +are performed, the same algorithm for balancing the routes is invoked.
> +
> +In the case of using the file based routing, any topology changes are
> +currently ignored The 'file' routing engine just loads the LFTs from the file
> +specified, with no reaction to real topology. Obviously, this will not be able
> +to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent
> +switches will be skipped. Multicast is not affected by 'file' routing engine
> +(this uses min hop tables).
> +
> +
> +Min Hop Algorithm
> +
> +The Min Hop algorithm is invoked when neither UPDN or the file method are
> +specified.
> +
> +The Min Hop algorithm is divided into two stages: computation of
> +min-hop tables on every switch and LFT output port assignment. Link
> +subscription is also equalized with the ability to override based on
> +port GUID. The latter is supplied by:
> +
> +-i <equalize-ignore-guids-file>
> +.br
> +-ignore-guids <equalize-ignore-guids-file>
> +          This option provides the means to define a set of ports
> +          (by guid) that will be ignored by the link load
> +          equalization algorithm. Note that only endports (CA,
> +          switch port 0, and router ports) and not switch external
> +          ports are supported.
> +
> +LMC awareness routes based on (remote) system or switch basis.
> +
> +
> +Purpose of UPDN Algorithm
> +
> +The UPDN algorithm is designed to prevent deadlocks from occurring in loops
> +of the subnet. A loop-deadlock is a situation in which it is no longer
> +possible to send data between any two hosts connected through the loop. As
> +such, the UPDN routing algorithm should be used if the subnet is not a pure
> +Fat Tree, and one of its loops may experience a deadlock (due, for example,
> +to high pressure).
> +
> +The UPDN algorithm is based on the following main stages:
> +
> +1.  Auto-detect root nodes - based on the CA hop length from any switch in
> +the subnet, a statistical histogram is built for each switch (hop num vs
> +number of occurrences). If the histogram reflects a specific column (higher
> +than others) for a certain node, then it is marked as a root node. Since
> +the algorithm is statistical, it may not find any root nodes. The list of
> +the root nodes found by this auto-detect stage is used by the ranking
> +process stage.
> +
> +    Note 1: The user can override the node list manually.
> +    Note 2: If this stage cannot find any root nodes, and the user did
> +            not specify a guid list file, OpenSM defaults back to the
> +            Min Hop routing algorithm.
> +
> +2.  Ranking process - All root switch nodes (found in stage 1) are assigned
> +a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the
> +subnet are ranked incrementally. This ranking aids in the process of enforcing
> +rules that ensure loop-free paths.
> +
> +3.  Min Hop Table setting - after ranking is done, a BFS algorithm is run from
> +each (CA or switch) node in the subnet. During the BFS process, the FDB table
> +of each switch node traversed by BFS is updated, in reference to the starting
> +node, based on the ranking rules and guid values.
> +
> +At the end of the process, the updated FDB tables ensure loop-free paths
> +through the subnet.
> +
> +Note: Up/Down routing does not allow LID routing communication between
> +switches that are located inside spine "switch systems".
> +The reason is that there is no way to allow a LID route between them
> +that does not break the Up/Down rule.
> +One ramification of this is that you cannot run SM on switches other
> +than the leaf switches of the fabric.
> +
> +
> +UPDN Algorithm Usage
> +
> +Activation through OpenSM
> +
> +Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
> +Use '-a <root_guid_file>' for adding an UPDN guid file that contains the
> +root nodes for ranking.
> +If the `-a' option is not used, OpenSM uses its auto-detect root nodes
> +algorithm.
> +
> +Notes on the guid list file:
> +
> +1.   A valid guid file specifies one guid in each line. Lines with an invalid
> +format will be discarded.
> +.br
> +2.   The user should specify the root switch guids. However, it is also
> +possible to specify CA guids; OpenSM will use the guid of the switch (if
> +it exists) that connects the CA to the subnet as a root node.
> +
> +
> +Fat-tree Routing Algorithm
> +
> +The fat-tree algorithm optimizes routing for "shift" communication pattern.
> +It should be chosen if a subnet is a symmetrical or almost symmetrical
> +fat-tree of various types.
> +It supports not just K-ary-N-Trees, by handling for non-constant K,
> +cases where not all leafs (CAs) are present, any CBB ratio.
> +As in UPDN, fat-tree also prevents credit-loop-deadlocks.
> +
> +If the root guid file is not provided ('-a' or '--root_guid_file' options),
> +the topology has to be pure fat-tree that complies with the following rules:
> +  - Tree rank should be between two and eight (inclusively)
> +  - Switches of the same rank should have the same number
> +    of UP-going port groups*, unless they are root switches,
> +    in which case the shouldn't have UP-going ports at all.
> +  - Switches of the same rank should have the same number
> +    of DOWN-going port groups, unless they are leaf switches.
> +  - Switches of the same rank should have the same number
> +    of ports in each UP-going port group.
> +  - Switches of the same rank should have the same number
> +    of ports in each DOWN-going port group.
> +  - All the CAs have to be at the same tree level (rank).
> +
> +If the root guid file is provided, the topology doesn't have to be pure
> +fat-tree, and it should only comply with the following rules:
> +  - Tree rank should be between two and eight (inclusively)
> +  - All the Compute Nodes** have to be at the same tree level (rank).
> +    Note that non-compute node CAs are allowed here to be at different
> +    tree ranks.
> +
> +* ports that are connected to the same remote switch are referenced as
> +\'port group\'.
> +
> +** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\'
> +OpenSM options.
> +
> +Topologies that do not comply cause a fallback to min hop routing.
> +Note that this can also occur on link failures which cause the topology
> +to no longer be "pure" fat-tree.
> +
> +Note that although fat-tree algorithm supports trees with non-integer CBB
> +ratio, the routing will not be as balanced as in case of integer CBB ratio.
> +In addition to this, although the algorithm allows leaf switches to have any
> +number of CAs, the closer the tree is to be fully populated, the more
> +effective the "shift" communication pattern will be.
> +In general, even if the root list is provided, the closer the topology to a
> +pure and symmetrical fat-tree, the more optimal the routing will be.
> +
> +The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
> +in the same directory where the OpenSM log resides. This ordering file provides
> +the CN order that may be used to create efficient communication pattern, that
> +will match the routing tables.
> +
> +Activation through OpenSM
> +
> +Use '-R ftree' option to activate the fat-tree algorithm.
> +Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option
> +is not used, routing algorithm will detect roots automatically.
> +Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option
> +is not used, all the CAs are considered as compute nodes.
> +
> +Note: LMC > 0 is not supported by fat-tree routing. If this is
> +specified, the default routing algorithm is invoked instead.
> +
> +
> +LASH Routing Algorithm
> +
> +LASH is an acronym for LAyered SHortest Path Routing. It is a
> +deterministic shortest path routing algorithm that enables topology
> +agnostic deadlock-free routing within communication networks.
> +
> +When computing the routing function, LASH analyzes the network
> +topology for the shortest-path routes between all pairs of sources /
> +destinations and groups these paths into virtual layers in such a way
> +as to avoid deadlock.
> +
> +Note LASH analyzes routes and ensures deadlock freedom between switch
> +pairs. The link from HCA between and switch does not need virtual
> +layers as deadlock will not arise between switch and HCA.
> +
> +In more detail, the algorithm works as follows:
> +
> +1) LASH determines the shortest-path between all pairs of source /
> +destination switches. Note, LASH ensures the same SL is used for all
> +SRC/DST - DST/SRC pairs and there is no guarantee that the return
> +path for a given DST/SRC will be the reverse of the route SRC/DST.
> +
> +2) LASH then begins an SL assignment process where a route is assigned
> +to a layer (SL) if the addition of that route does not cause deadlock
> +within that layer. This is achieved by maintaining and analysing a
> +channel dependency graph for each layer. Once the potential addition
> +of a path could lead to deadlock, LASH opens a new layer and continues
> +the process.
> +
> +3) Once this stage has been completed, it is highly likely that the
> +first layers processed will contain more paths than the latter ones.
> +To better balance the use of layers, LASH moves paths from one layer
> +to another so that the number of paths in each layer averages out.
> +
> +Note, the implementation of LASH in opensm attempts to use as few layers
> +as possible. This number can be less than the number of actual layers
> +available.
> +
> +In general LASH is a very flexible algorithm. It can, for example,
> +reduce to Dimension Order Routing in certain topologies, it is topology
> +agnostic and fares well in the face of faults.
> +
> +It has been shown that for both regular and irregular topologies, LASH
> +outperforms Up/Down. The reason for this is that LASH distributes the
> +traffic more evenly through a network, avoiding the bottleneck issues
> +related to a root node and always routes shortest-path.
> +
> +The algorithm was developed by Simula Research Laboratory.
> +
> +
> +Use '-R lash -Q ' option to activate the LASH algorithm.
> +
> +Note: QoS support has to be turned on in order that SL/VL mappings are
> +used.
> +
> +Note: LMC > 0 is not supported by the LASH routing. If this is
> +specified, the default routing algorithm is invoked instead.
> +
> +
> +DOR Routing Algorithm
> +
> +The Dimension Order Routing algorithm is based on the Min Hop
> +algorithm and so uses shortest paths.  Instead of spreading traffic
> +out across different paths with the same shortest distance, it chooses
> +among the available shortest paths based on an ordering of dimensions.
> +Each port must be consistently cabled to represent a hypercube
> +dimension or a mesh dimension.  Paths are grown from a destination
> +back to a source using the lowest dimension (port) of available paths
> +at each step.  This provides the ordering necessary to avoid deadlock.
> +When there are multiple links between any two switches, they still
> +represent only one dimension and traffic is balanced across them
> +unless port equalization is turned off.  In the case of hypercubes,
> +the same port must be used throughout the fabric to represent the
> +hypercube dimension and match on both ends of the cable.  In the case
> +of meshes, the dimension should consistently use the same pair of
> +ports, one port on one end of the cable, and the other port on the
> +other end, continuing along the mesh dimension.
> +
> +Use '-R dor' option to activate the DOR algorithm.
> +
> +
> +Routing References
> +
> +To learn more about deadlock-free routing, see the article
> +"Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
> +by William J Dally and Charles L Seitz (1985).
> +
> +To learn more about the up/down algorithm, see the article
> +"Effective Strategy to Compute Forwarding Tables for InfiniBand Networks"
> +by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the
> +Universidad Politecnica de Valencia.
> +
> +To learn more about LASH and the flexibility behind it, the requirement
> +for layers, performance comparisons to other algorithms, see the
> +following articles:
> +
> +"Layered Routing in Irregular Networks", Lysne et al, IEEE
> +Transactions on Parallel and Distributed Systems, VOL.16, No12,
> +December 2005.
> +
> +"Routing for the ASI Fabric Manager", Solheim et al. IEEE
> +Communications Magazine, Vol.44, No.7, July 2006.
> +
> +"Layered Shortest Path (LASH) Routing in Irregular System Area
> +Networks", Skeie et al. IEEE Computer Society Communication
> +Architecture for Clusters 2002.
> +
> +
> +Modular Routine Engine
> +
> +Modular routing engine structure allows for the ease of
> +"plugging" new routing modules.
> +
> +Currently, only unicast callbacks are supported. Multicast
> +can be added later.
> +
> +One existing routing module is up-down "updn", which may be
> +activated with '-R updn' option (instead of old '-u').
> +
> +General usage is:
> +$ opensm -R 'module-name'
> +
> +There is also a trivial routing module which is able
> +to load LFT tables from a dump file.
> +
> +Main features:
> +
> + - this will load switch LFTs and/or LID matrices (min hops tables)
> + - this will load switch LFTs according to the path entries introduced
> +   in the dump file
> + - no additional checks will be performed (such as "is port connected",
> +   etc.)
> + - in case when fabric LIDs were changed this will try to reconstruct
> +   LFTs correctly if endport GUIDs are represented in the dump file
> +   (in order to disable this, GUIDs may be removed from the dump file
> +    or zeroed)
> +
> +The dump file format is compatible with output of 'ibroute' util and for
> +whole fabric can be generated with dump_lfts.sh script.
> +
> +To activate file based routing module, use:
> +
> +  opensm -R file -U /path/to/dump_file
> +
> +If the dump_file is not found or is in error, the default routing
> +algorithm is utilized.
> +
> +The ability to dump switch lid matrices (aka min hops tables) to file and
> +later to load these is also supported.
> +
> +The usage is similar to unicast forwarding tables loading from dump
> +file (introduced by 'file' routing engine), but new lid matrix file
> +name should be specified by -M or --lid_matrix_file option. For example:
> +
> +  opensm -R file -M ./opensm-lid-matrix.dump
> +
> +The dump file is named \'opensm-lid-matrix.dump\' and will be generated
> +in standard opensm dump directory (/var/log by default) when
> +OSM_LOG_ROUTING logging flag is set.
> +
> +When routing engine 'file' is activated, but dump file is not specified
> +or not cannot be open default lid matrix algorithm will be used.
> +
> +There is also a switch forwarding tables dumper which generates
> +a file compatible with dump_lfts.sh output. This file can be used
> +as input for forwarding tables loading by 'file' routing engine.
> +Both or one of options -U and -M can be specified together with \'-R file\'.
> +
> +.SH FILES
> +.TP
> +.B @CONF_DIR@/prefix-routes.conf
> +default prefix routes file.
> +
> +.SH AUTHORS
> +.TP
> +Hal Rosenstock
> +.RI < hal at xsigo.com >
> +.TP
> +Sasha Khapyorsky
> +.RI < sashak at voltaire.com >
> +.TP
> +Eitan Zahavi
> +.RI < eitan at mellanox.co.il >
> +.TP
> +Yevgeny Kliteynik
> +.RI < kliteyn at mellanox.co.il >
> +.TP
> +Thomas Sodring
> +.RI < tsodring at simula.no >



More information about the general mailing list