[ofa-general] QoS RFC

Hal Rosenstock hal.rosenstock at gmail.com
Tue Aug 7 10:56:44 PDT 2007


Hi Yevgeny,

On 8/2/07, Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
> Hi Hal,
>
> Hal Rosenstock wrote:
> > Hi Yevgeny,
> >
> > On 7/21/07, Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
> >> Hi All
> >>
> >> Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack.
> >> Your comments are welcome.
> >
> > A couple of quick questions:
> >
> > How does this differ from the original RFC posted 5/30/06 ?
> >
> > What I can see is the following:
> > 1. Updated for not yet released IBTA QoS Annex
> > 2. Use of plain text rather than XML based policy file for OpenSM
> > Anything else ?
>
> You're absolutely right - these are the only changes (plus cosmetics here and there).
>
> > Below, IPoIB is discussed in terms of UD. What about IPoIB-CM ? It
> > uses CM and has a service ID.

Will IPoIB CM be added to the RFC document or is it the same as UD ?

> >
> > Also, have my specific comments to the patches originally submitted
> > been addressed ? (Do I need to dig them out again ?) Just wondering...
>
> Yes. The submitted patches were only QoS policy file parser.
> In the new parser I took care of all the issues we've discussed
> a couple of months ago. Here is the summary of these issues
> taken from our discussion:
>
> [snip]
>        XML syntax or not -> Plain text, human readable and easily editable
>
>        QoS syntax explanation/discussion
>
>                changes to some keywords
>                        Port r.t. Node Groups  -> DONE: fixed keywords
>                        CA r.t. HCA            -> DONE: fixed keywords
>                        QoSClass r.t. TClass   -> DONE: fixed keywords
>                syntax discussion points
>                    larger ones:
>                        dynamic service IDs -> supported through list and range support
>                        service ID range support -> DONE: added to the matching rules examples
>                        port groups shared with partition configuration (future) -> agree, it would be a good
>                                              idea to share port groups with partition configuration,
>                                              but it won't be for OFED 1.3
>                        multicast -> not planned for OFED 1.3, but we'll discuss it later
>                    smaller ones:
>                        across syntax explanation -> DONE: see the explanation and an example in the policy file
>                        What is sn in the syntax short for ? -> it was for "serial number", replaced by
>                                              "qos-level-sn". It means the serial (sequential) number
>                                              of the qos-level that should be applied to PathRecords
>                                              that matches this qos-match-rule. I probably will change
>                                              the "qol-level-sn" to "qos-level-name" to refer QoS level
>                                              by name rather than by sn.
>                        path bits explanation     -> Path bits are part of QoS level.
>                                              They can be used to "differentiate" paths through the subnet
>                                              to a port when LMC>0. It won't be implemented yet for
>                                              OFED 1.3, and OpenSM should issue a warning if it finds
>                                              PathBits in QoS level definition in the policy file
>                        packet lifetime ?         -> DONE: added packet-life keyword
>
>
>            viewer/editor -> Since we've switched to plain text, this one becomes irrelevant

Thanks.

-- Hal

> [/snip]
>
>
> -- Yevgeny
>
>
>
> > Thanks.
> >
> > -- Hal
> >
> >
> >
> >
> >> -- Yevgeny
> >>
> >>               RFC: OpenFabrics Enhancements for QoS Support
> >>              ===============================================
> >>
> >> Authors: . Eitan Zahavi <eitan at mellanox.co.il>
> >> Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
> >> Date: .... Jul 2007.
> >> Revision:  0.2
> >>
> >> Table of contents:
> >> 1. Overview
> >> 2. Architecture
> >> 3. Supported Policy
> >> 4. CMA functionality
> >> 5. IPoIB functionality
> >> 6. SDP functionality
> >> 7. SRP functionality
> >> 8. iSER functionality
> >> 9. OpenSM functionality
> >>
> >> 1. Overview
> >> ------------
> >> Quality of Service requirements stem from the realization of I/O consolidation
> >> over IB network: As multiple applications and ULPs share the same fabric, means
> >> to control their use of the network resources are becoming a must. The basic
> >> need is to differentiate the service levels provided to different traffic flows,
> >> such that a policy could be enforced and control each flow utilization of the
> >> fabric resources.
> >>
> >> IBTA specification defined several hardware features and management interfaces
> >> to support QoS:
> >> * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
> >> * Arbitration between traffic of different VLs is performed by a 2 priority
> >>   levels weighted round robin arbiter. The arbiter is programmable with
> >>   a sequence of (VL, weight) pairs and maximal number of high priority credits
> >>   to be processed before low priority is served
> >> * Packets carry class of service marking in the range 0 to 15 in their
> >>   header SL field
> >> * Each switch can map the incoming packet by its SL to a particular output
> >>   VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
> >> * The Subnet Administrator controls each communication flow parameters
> >>   by providing them as a response to Path Record (PR) or MultiPathRecord (MPR)
> >>   queries
> >>
> >> The IB QoS features provide the means to implement a DiffServ like architecture.
> >> DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic
> >> fabrics.
> >>
> >> This proposal provides the detailed functional definition for the various
> >> software elements that are required to enable a DiffServ like architecture over
> >> the OpenFabrics software stack.
> >>
> >>
> >>
> >> 2. Architecture
> >> ----------------
> >> This proposal split the QoS functionality between the SM/SA, CMA and the various
> >> ULPS. We take the "chronology approach" to describe how the overall system
> >> works:
> >>
> >> 2.1. The network manager (human) provides a set of rules (policy) that defines
> >> how the network is being configured and how its resources are split to different
> >> QoS-Levels. The policy also define how to decide which QoS-Level each
> >> application or ULP or service use.
> >>
> >> 2.2. The SM analyzes the provided policy to see if it is realizable and performs
> >> the necessary fabric setup. The SM may continuously monitor the policy and adapt
> >> to changes in it. Part of this policy defines the default QoS-Level of each
> >> partition. The SA is being enhanced to match the requested Source, Destination,
> >> QoS-Class, Service-ID (and optionally SL and priority) against the policy. So
> >> clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
> >> enhanced to support setting up partitions with appropriate IPoIB broadcast
> >> group. This broadcast group carries its QoS attributes: SL, MTU and
> >> RATE.
> >>
> >> 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the
> >> multicast group which forms the broadcast group of this partition.
> >>
> >> 2.4. MPI which provides non IB based connection management should be configured
> >> to run using hard coded SLs. It uses these SLs for every QP being opened.
> >>
> >> 2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned
> >> Service-ID and use it while obtaining PR/MPR for establishing connections.
> >> The SA receiving the PR/MPR should match it against the policy and return
> >> the appropriate PR/MPR including SL, MTU and RATE.
> >>
> >> 2.6. ULPs and programs using CMA to establish RC connection should provide the
> >> CMA the target IP and Service-ID. Some of the ULPs might also provide QoS-Class
> >> (E.g. for SDP sockets that are provided the TOS socket option). The CMA should
> >> then use the provided Service-ID and optional QoS-Class and pass them in the
> >> PR/MPR request. The resulting PR/MPR should be used for configuring the
> >> connection QP.
> >>
> >> PathRecord and MultiPathRecord enhancement for QoS:
> >> As mentioned above the PathRecord and MultiPathRecord attributes should be
> >> enhanced to carry the Service-ID which is a 64bit value, which has been
> >> standardized by the IBTA. A new field QoS-Class is also provided.
> >> A new capability bit should describe the SM QoS support in the SA class port
> >> info. This approach provides an easy migration path for existing access layer
> >> and ULPs by not introducing new set of PR/MPR attribute.
> >>
> >>
> >> 3. Supported Policy
> >> --------------------
> >>
> >> The QoS policy supported by this proposal is divided into 4 sub sections:
> >>
> >> I) Port Group: a set of CAs, Routers or Switches that share the same settings.
> >> A port group might be a partition defined by the partition manager policy in
> >> terms of GUIDs. Future implementations might provide support for NodeDescription
> >> based definition of port groups.
> >>
> >> II) Fabric Setup:
> >> Defines how the SL2VL and VLArb tables should be setup. This policy definition
> >> assumes the computation of overall end to end network behavior should be performed
> >> outside of OpenSM.
> >>
> >> III) QoS-Levels Definition:
> >> This section defines the possible sets of parameters for QoS that a client
> >> might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
> >> Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).
> >>
> >> IV) Matching Rules:
> >> A list of rules that match an incoming PR/MPR request to a QoS-Level. The
> >> rules are processed in order such as the first match is applied. Each rule is
> >> built out of a set of match expressions which should all match for the rule to
> >> apply. The matching expressions are defined for the following fields
> >> ** SRC and DST to lists of port groups
> >> ** Service-ID to a list of Service-ID or Service-ID ranges
> >> ** QoS-Class to a list of QoS-Class values or ranges
> >>
> >> QoS Policy file syntax
> >>
> >> * Empty lines are ignored
> >> * Leading and trailing blanks, as well as empty lines, are ignored, so the
> >>   indentation in the example is just for better readability
> >> * Comments are started with the pound sign (#) and terminated by EOL
> >> * Comments may appear only in a separate line
> >> * Keywords that denote section/subsection start have matching closing keywords
> >> * Any keyword should be the first non-blank in the line
> >>
> >> QoS Policy file example
> >>
> >>     # Port Groups define sets of ports to be used later in the settings
> >>     port-groups
> >>         # using port GUIDs
> >>         port-group
> >>             name: Storage
> >>             # "use" is just a description that is used for logging.
> >>             #  Other than that, it is just a commentary
> >>             use: our SRP storage targets
> >>             port-guid: 0x1000000000000001
> >>             port-guid: 0x1000000000000002
> >>         end-port-group
> >>
> >>         port-group
> >>             name: Virtual Servers
> >>             use: node desc and IB port num
> >>             # The syntax of the port name is as follows: "hostname/CA-num/Pnum".
> >>             # "hostname" and "CA-num" are compared to the first 2 words of
> >>             # NodeDescription, and "Pnum" is a port number on that node.
> >>             port-name: vs1/HCA-1/P1
> >>             port-name: vs3/HCA-1/P1
> >>             port-name: vs3/HCA-2/P2
> >>         end-port-group
> >>
> >>         # using partitions defined in the partition policy
> >>         port-group
> >>             name: Group for Partition 1
> >>             use: default settings
> >>             partition: Part1
> >>         end-port-group
> >>
> >>         # using node types CA|ROUTER|SWITCH
> >>         port-group
> >>             name: Routers
> >>             use: all routers
> >>             node-type: ROUTER
> >>         end-port-group
> >>
> >>     end-port-groups
> >>
> >>     qos-setup
> >>
> >>         # define all types of VLArb tables. The length of the tables should
> >>         # match the physically supported tables by their target ports
> >>         vlarb-tables
> >>             # scope defines the exact ports the VLArb tables apply to
> >>             vlarb-scope
> >>                 # defining VLArb tables on all the ports that belong to
> >>                 # port group 'Storage', and on all the ports connected
> >>                 # to ports of port group 'Storage'
> >>                 group: Storage
> >>                 # "across" means all the ports that are connected to ports
> >>                 # that belong to the specified port group
> >>                 across: Storage
> >>                 # VLArb table holds VL and weight pairs
> >>                 vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
> >>                 vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
> >>                 vl-high-limit: 10
> >>             end-vlarb-scope
> >>             # There can be several scopes
> >>         end-vlarb-tables
> >>
> >>         sl2vl-tables
> >>             # Scope defines the exact devices and in/out ports tables apply to.
> >>             # Note: if the same port is matching several rules the *FIRST* one applies.
> >>             sl2vl-scope
> >>                 # SL2VL tables are orgnized as SL2VL(in-port,out-port)
> >>                 # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
> >>                 # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
> >>                 #
> >>                 # The following example specifies that all the SL2VL tables
> >>                 # entries should be defined for all the ports of group Part1:
> >>                 group: Part1
> >>                 from: *
> >>                 to: *
> >>                 # SL2VL table has to have 16 values at max - one for each SL.
> >>                 # If the user specifies less than 16 values, all the missing
> >>                 # VL values will be implicitly set to 0
> >>                 sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> >>             end-sl2vl-scope
> >>
> >>             sl2vl-scope
> >>                 # "across-to" is a combination of "across" keyword (definition can be found
> >>                 # in VLArb tables section) and "to" keyword.
> >>                 # "across: PortGroupName" refers to all the ports that are connected
> >>                 # to ports that belong to PortGroupName.
> >>                 #
> >>                 # Example of "across-to" usage:
> >>                 #   A user has a set of 'special' nodes (e.g. storage nodes), and all
> >>                 #   the traffic to these nodes has to get specific VL.
> >>                 #   The solution is to define port group (i.g. "Storage") that will
> >>                 #   include all the ports of these nodes, and then to configure SL2VL
> >>                 #   tables on all the switch ports that are connected to the Storage
> >>                 #   port group by specifying "across-to: Storage".
> >>                 #
> >>                 across-to: Storage2
> >>                 # Similar to "across-to", "across-from" is a combination of "across"
> >>                 # and "to" keywords
> >>                 across-from: Storage1
> >>                 sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
> >>             end-sl2vl-scope
> >>         end-sl2vl-tables
> >>
> >>     end-qos-setup
> >>
> >>
> >>     qos-levels
> >>
> >>         # the first one is just setting SL
> >>         qos-level
> >>             use: for the lowest priority communication
> >>             sl: 15
> >>             packet-life: 16
> >>         end-qos-level
> >>         # the second sets SL and QoS Class
> >>         qos-level
> >>             use: low latency best bandwidth
> >>             sl: 0
> >>         end-qos-level
> >>         # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path Bits
> >>         qos-level
> >>             use: just an example
> >>             sl: 0
> >>             mtu-limit: 1
> >>             rate-limit: 1
> >>             packet-life: 12
> >>             # Path Bits can be used e.g. to provide a different routes through the
> >>             # subnet to a particular port
> >>             path-bits: 2,4,8-32
> >>         end-qos-level
> >>
> >>     end-qos-levels
> >>
> >>
> >>     # Match rules are scanned in a first-fit manner (like firewall rules table)
> >>     qos-match-rules
> >>
> >>         # matching by single criteria: class (list of values and ranges)
> >>         qos-match-rule
> >>             # just a description
> >>             use: low latency by class 7-9 or 11
> >>             qos-class: 7-9,11
> >>             # number of qos-level to apply to the matching PR/MPR
> >>             qos-level-sn: 1
> >>         end-qos-match-rule
> >>         # show matching by destination group AND service-ids
> >>         qos-match-rule
> >>             use: Storage targets connection
> >>             destination: Storage
> >>             service-id: 22,4719-5000
> >>             qos-level-sn: 2
> >>         end-qos-match-rule
> >>         # show matching by source group only
> >>         qos-match-rule
> >>             use: bla bla
> >>             source: Storage
> >>             qos-level-sn: 3
> >>         end-qos-match-rule
> >>
> >>     end-qos-match-rules
> >>
> >>
> >> 4. IPoIB
> >> ---------
> >>
> >> IPoIB already query the SA for its broadcast group information. The additional
> >> functionality required is for IPoIB to provide the broadcast group SL, MTU,
> >> and RATE in every following PathRecord query performed when a new UDAV is
> >> needed by IPoIB.
> >> We could assign a special Service-ID for IPoIB use but since all communication
> >> on the same IPoIB interface shares the same QoS-Level without the ability to
> >> differentiate it by target service we can ignore it for simplicity.
> >>
> >> 5. CMA features
> >> ----------------
> >>
> >> The CMA interface supports Service-ID through the notion of port space as a
> >> prefixes to the port_num which is part of the sockaddr provided to
> >> rdma_resolve_add(). What is missing is the explicit request for a QoS-Class that
> >> should allow the ULP (like SDP) to propagate a specific request for a class of
> >> service. A mechanism for providing the QoS-Class is available in the IPv6 address,
> >> so we could use that address field. Another option is to implement a special
> >> connection options API for CMA.
> >>
> >> Missing functionality by CMA is the usage of the provided QoS-Class and Service-ID
> >> in the sent PR/MPR. When a response is obtained it is an existing requirement for
> >> the CMA to use the PR/MPR from the response in setting up the QP address vector.
> >>
> >>
> >> 6. SDP
> >> -------
> >>
> >> SDP uses CMA for building its connections.
> >> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
> >> holding the remote TCP/IP Port Number to connect to.
> >> SDP might be provided with SO_PRIORITY socket option. In that case the value
> >> provided should be sent to the CMA as the TClass option of that connection.
> >>
> >> 7. SRP
> >> -------
> >>
> >> Current SRP implementation uses its own CM callbacks (not CMA). So SRP should
> >> fill in the Service-ID in the PR/MPR by itself and use that information in
> >> setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined
> >> by the SRP target I/O Controller (but they should also comply with IBTA Service-
> >> ID rules). Anyway, the Service-ID is reported by the I/O Controller in the
> >> ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
> >> reports its ability to handle QoS PR/MPRs.
> >>
> >> 8. iSER
> >> --------
> >> iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER
> >> should be TBD.
> >>
> >>
> >> 9. OpenSM features
> >> -------------------
> >> The QoS related functionality to be provided by OpenSM can be split into two
> >> main parts:
> >>
> >> 3.1. Fabric Setup
> >> During fabric initialization the SM should parse the policy and apply its
> >> settings to the discovered fabric elements. The following actions should be
> >> performed:
> >> * Parsing of policy
> >> * Node Group identification. Warning should be provided for each node not
> >>   specified but found.
> >> * SL2VL settings validation should be checked:
> >>   + A warning will be provided if there are no matching targets for the SL2VL
> >>     setting statement.
> >>   + An error message will be printed to the log file if an invalid setting is
> >>     found. A setting is invalid if it refers to:
> >>     - Non existing port numbers of the target devices
> >>     - Unsupported VLs for the target device. In the later case the map to non
> >>       existing VLs should be replaced to VL15 i.e. packets will be dropped.
> >> * SL2VL setting is to be performed
> >> * VL Arbitration table settings should be validated according to the following
> >>   rules:
> >>   + A warning will be provided if there are no matching targets for the setting
> >>     statement
> >>   + An error will be provided if the port number exceeds the target ports
> >>   + An error will be generated if the table length exceeds device capabilities
> >>   + A warning will be generated if the table quote a VL that is not supported
> >>     by the target device
> >> * VL Arbitration tables will be set on the appropriate targets
> >>
> >> 3.2. PR/MPR query handling:
> >> OpenSM should be able to enforce the provided policy on client request.
> >> The overall flow for such requests is: first the request is matched against the
> >> defined match rules such that the target QoS-Level definition is found. Given
> >> the QoS-Level a path(s) search is performed with the given restrictions imposed
> >> by that level. The following two sections describe these steps.
> >>
> >> How Service-ID is carried in the PathRecord and MultiPathRecord attributes is
> >> now standardized by the IBTA.
> >>
> >>
> >> 3.2.1. Matching rule search:
> >> A rule is "matching" a PR/MPR request using the following criteria:
> >> * Matching rules provide values in a list of either single value, or range of
> >>   values. A PR/MPR field is "matching" the rule field if it is explicitly
> >>   noted in the list of values or is one of the values covered by a range
> >>   included in the field values list.
> >> * Only PR/MPR fields that have their component mask bit set should be
> >>   compared.
> >> * For a rule to be "matching" a PR/MPR request all the rule fields should be
> >>   "matching" their PR/MPR fields. Such that a PR/MPR request that does
> >>   not have a component mask field set for one of the rule defined fields  can
> >>   not match that rule.
> >> * A PR/MPR request that have a component mask bit set for one of the fields
> >>   that is not defined by the rule can match the rule.
> >>
> >> The algorithm to be used for searching for a rule match might be as simple as a
> >> sequential search through all rules or enhanced for better performance. The
> >> semantics of every rule field and its matching PR/MPR field are described
> >> below:
> >> * Source: the SGID or SLID should be part of this group
> >> * Destination: the DGID or DLID should be part of this group
> >> * Service-ID: check if the requested Service-ID (available in the PR/MPR old
> >>   SM-Key field) is matching any of this rule Service-IDs
> >> * TClass: check if the PR/MPR TClass field is matching
> >>
> >> 3.2.2 PR/MPR response generation:
> >> The QoS-Level pointed by the first rule that matches the PR/MPR request
> >> should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits
> >> and QoS-Class. A default QoS-Level should be used if no rule is matching the query.
> >>
> >> The efficient algorithm for finding paths that meet the QoS-Level criteria is
> >> beyond the scope of this RFC and left for the implementer to provide. However
> >> the criteria by which the paths match the QoS-Level are described below:
> >>
> >> * SL: The paths found should all use the given SL. For that sake PR/MPR
> >>   algorithm should traverse the path from source to destination only through
> >>   ports that carry a valid VL (not VL15) by the SL2VL map (should consider input
> >>   and output ports and SL).
> >> * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
> >> * Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit
> >>   (rate limit is given in units of link BW = Width*Speed according to IBTA
> >>   Specification Vol-1 table-205 p-901 l-24).
> >> * Path-Bits: define the target LID lowest bits (number of bits defined by the
> >>   target port PortInfo.LMC field). The path should traverse the LFT using the
> >>   target port LID with the path-bits set.
> >> * QoS-Class: should be returned in the result PR/MPR. When routing is going to
> >>   be supported by OpenSM we might use this field in selecting the target
> >>   router too in a TBD way.
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>
> >
>
>



More information about the general mailing list