[ofa-general] QoS RFC

Sasha Khapyorsky sashak at voltaire.com
Sun Jul 22 17:20:11 PDT 2007


Hi Yevgeny,

Some initial comments.

On 01:07 Sun 22 Jul     , Yevgeny Kliteynik wrote:
>  Hi All
> 
>  Please find the attached RFC describing how QoS policy support could be 
>  implemented in the OpenFabrics stack.
>  Your comments are welcome.
> 
>  -- Yevgeny
> 
>                RFC: OpenFabrics Enhancements for QoS Support
>               ===============================================
> 
>  Authors: . Eitan Zahavi <eitan at mellanox.co.il>
>  Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
>  Date: .... Jul 2007.
>  Revision:  0.2
> 
>  Table of contents:
>  1. Overview
>  2. Architecture
>  3. Supported Policy
>  4. CMA functionality
>  5. IPoIB functionality
>  6. SDP functionality
>  7. SRP functionality
>  8. iSER functionality
>  9. OpenSM functionality
> 
>  1. Overview
>  ------------
>  Quality of Service requirements stem from the realization of I/O 
>  consolidation
>  over IB network: As multiple applications and ULPs share the same fabric, 
>  means
>  to control their use of the network resources are becoming a must. The basic
>  need is to differentiate the service levels provided to different traffic 
>  flows,
>  such that a policy could be enforced and control each flow utilization of 
>  the
>  fabric resources.
> 
>  IBTA specification defined several hardware features and management 
>  interfaces
>  to support QoS:
>  * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
>  * Arbitration between traffic of different VLs is performed by a 2 priority
>    levels weighted round robin arbiter. The arbiter is programmable with
>    a sequence of (VL, weight) pairs and maximal number of high priority 
>  credits
>    to be processed before low priority is served
>  * Packets carry class of service marking in the range 0 to 15 in their
>    header SL field
>  * Each switch can map the incoming packet by its SL to a particular output
>    VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
>  * The Subnet Administrator controls each communication flow parameters
>    by providing them as a response to Path Record (PR) or MultiPathRecord 
>  (MPR)
>    queries
> 
>  The IB QoS features provide the means to implement a DiffServ like 
>  architecture.
>  DiffServ architecture (IETF RFC2474 2475) is widely used today in highly 
>  dynamic
>  fabrics.
> 
>  This proposal provides the detailed functional definition for the various
>  software elements that are required to enable a DiffServ like architecture 
>  over
>  the OpenFabrics software stack.
> 
> 
> 
>  2. Architecture
>  ----------------
>  This proposal split the QoS functionality between the SM/SA, CMA and the 
>  various
>  ULPS. We take the "chronology approach" to describe how the overall system
>  works:
> 
>  2.1. The network manager (human) provides a set of rules (policy) that 
>  defines
>  how the network is being configured and how its resources are split to 
>  different
>  QoS-Levels. The policy also define how to decide which QoS-Level each
>  application or ULP or service use.
> 
>  2.2. The SM analyzes the provided policy to see if it is realizable and 
>  performs
>  the necessary fabric setup. The SM may continuously monitor the policy and 
>  adapt
>  to changes in it. Part of this policy defines the default QoS-Level of each
>  partition. The SA is being enhanced to match the requested Source, 
>  Destination,
>  QoS-Class, Service-ID (and optionally SL and priority) against the policy. 
>  So
>  clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
>  enhanced to support setting up partitions with appropriate IPoIB broadcast
>  group. This broadcast group carries its QoS attributes: SL, MTU and
>  RATE.
> 
>  2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the
>  multicast group which forms the broadcast group of this partition.
> 
>  2.4. MPI which provides non IB based connection management should be 
>  configured
>  to run using hard coded SLs. It uses these SLs for every QP being opened.
> 
>  2.5. ULPs that use CM interface (like SRP) should have their own 
>  pre-assigned
>  Service-ID and use it while obtaining PR/MPR for establishing connections.
>  The SA receiving the PR/MPR should match it against the policy and return
>  the appropriate PR/MPR including SL, MTU and RATE.
> 
>  2.6. ULPs and programs using CMA to establish RC connection should provide 
>  the
>  CMA the target IP and Service-ID. Some of the ULPs might also provide 
>  QoS-Class
>  (E.g. for SDP sockets that are provided the TOS socket option). The CMA 
>  should
>  then use the provided Service-ID and optional QoS-Class and pass them in the
>  PR/MPR request. The resulting PR/MPR should be used for configuring the
>  connection QP.
> 
>  PathRecord and MultiPathRecord enhancement for QoS:
>  As mentioned above the PathRecord and MultiPathRecord attributes should be
>  enhanced to carry the Service-ID which is a 64bit value, which has been
>  standardized by the IBTA. A new field QoS-Class is also provided.
>  A new capability bit should describe the SM QoS support in the SA class port
>  info. This approach provides an easy migration path for existing access 
>  layer
>  and ULPs by not introducing new set of PR/MPR attribute.
> 
> 
>  3. Supported Policy
>  --------------------
> 
>  The QoS policy supported by this proposal is divided into 4 sub sections:
> 
>  I) Port Group: a set of CAs, Routers or Switches that share the same 
>  settings.
>  A port group might be a partition defined by the partition manager policy in
>  terms of GUIDs. Future implementations might provide support for 
>  NodeDescription
>  based definition of port groups.

Isn't it better to have port group definitions in separate file? So
groups could be shared with other OpenSM components (as discussed). Even
if such group sharing is not high priority functionality this should
save us from redoing things later.

>  II) Fabric Setup:
>  Defines how the SL2VL and VLArb tables should be setup. This policy 
>  definition
>  assumes the computation of overall end to end network behavior should be 
>  performed
>  outside of OpenSM.
> 
>  III) QoS-Levels Definition:
>  This section defines the possible sets of parameters for QoS that a client
>  might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
>  Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).
> 
>  IV) Matching Rules:
>  A list of rules that match an incoming PR/MPR request to a QoS-Level. The
>  rules are processed in order such as the first match is applied. Each rule 
>  is
>  built out of a set of match expressions which should all match for the rule 
>  to
>  apply. The matching expressions are defined for the following fields
>  ** SRC and DST to lists of port groups
>  ** Service-ID to a list of Service-ID or Service-ID ranges
>  ** QoS-Class to a list of QoS-Class values or ranges
> 
>  QoS Policy file syntax
> 
>  * Empty lines are ignored
>  * Leading and trailing blanks, as well as empty lines, are ignored, so the
>    indentation in the example is just for better readability
>  * Comments are started with the pound sign (#) and terminated by EOL
>  * Comments may appear only in a separate line

Why? What is wrong with:

	port-name: vs1/HCA-1/P1   # my best port

>  * Keywords that denote section/subsection start have matching closing 
>  keywords
>  * Any keyword should be the first non-blank in the line
> 
>  QoS Policy file example
> 
>      # Port Groups define sets of ports to be used later in the settings
>      port-groups
>          # using port GUIDs
>          port-group
>              name: Storage
>              # "use" is just a description that is used for logging.
>              #  Other than that, it is just a commentary
>              use: our SRP storage targets
>              port-guid: 0x1000000000000001
>              port-guid: 0x1000000000000002
>          end-port-group
> 
>          port-group
>              name: Virtual Servers
>              use: node desc and IB port num
>              # The syntax of the port name is as follows: 
>  "hostname/CA-num/Pnum".
>              # "hostname" and "CA-num" are compared to the first 2 words of
>              # NodeDescription, and "Pnum" is a port number on that node.
>              port-name: vs1/HCA-1/P1
>              port-name: vs3/HCA-1/P1
>              port-name: vs3/HCA-2/P2

What about wild carding here, like vs1/*/* or just vs1?

>          end-port-group
> 
>          # using partitions defined in the partition policy
>          port-group
>              name: Group for Partition 1
>              use: default settings
>              partition: Part1
>          end-port-group
> 
>          # using node types CA|ROUTER|SWITCH

Probably also ALL (for all ports), SELF (for SM port)?

>          port-group
>              name: Routers
>              use: all routers
>              node-type: ROUTER
>          end-port-group
> 
>      end-port-groups

I agree that proposed syntax has better for human readability than pure
XML, but isn't stuff like this will be more user-friendly?

Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ;

, or

Storage "Free Text description" { 0x10001, 0x10002, 0x10003 };

, or

Storage "Free Text description": ROUTERS, CAS ;


> 
>      qos-setup
> 
>          # define all types of VLArb tables. The length of the tables should
>          # match the physically supported tables by their target ports
>          vlarb-tables
>              # scope defines the exact ports the VLArb tables apply to
>              vlarb-scope
>                  # defining VLArb tables on all the ports that belong to
>                  # port group 'Storage', and on all the ports connected
>                  # to ports of port group 'Storage'
>                  group: Storage

So "group" is only for ports that belong to 'Storage'?

>                  # "across" means all the ports that are connected to ports
>                  # that belong to the specified port group
>                  across: Storage
>                  # VLArb table holds VL and weight pairs
>                  vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>                  vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>                  vl-high-limit: 10
>              end-vlarb-scope
>              # There can be several scopes
>          end-vlarb-tables
> 
>          sl2vl-tables
>              # Scope defines the exact devices and in/out ports tables apply 
>  to.
>              # Note: if the same port is matching several rules the *FIRST* 
>  one applies.
>              sl2vl-scope
>                  # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>                  # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
>                  # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>                  #
>                  # The following example specifies that all the SL2VL tables
>                  # entries should be defined for all the ports of group 
>  Part1:
>                  group: Part1
>                  from: *
>                  to: *
>                  # SL2VL table has to have 16 values at max - one for each 
>  SL.
>                  # If the user specifies less than 16 values, all the missing
>                  # VL values will be implicitly set to 0
>                  sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>              end-sl2vl-scope
> 
>              sl2vl-scope
>                  # "across-to" is a combination of "across" keyword 
>  (definition can be found
>                  # in VLArb tables section) and "to" keyword.
>                  # "across: PortGroupName" refers to all the ports that are 
>  connected
>                  # to ports that belong to PortGroupName.
>                  #
>                  # Example of "across-to" usage:
>                  #   A user has a set of 'special' nodes (e.g. storage 
>  nodes), and all
>                  #   the traffic to these nodes has to get specific VL.
>                  #   The solution is to define port group (i.g. "Storage") 
>  that will
>                  #   include all the ports of these nodes, and then to 
>  configure SL2VL
>                  #   tables on all the switch ports that are connected to the 
>  Storage
>                  #   port group by specifying "across-to: Storage".
>                  #
>                  across-to: Storage2
>                  # Similar to "across-to", "across-from" is a combination of 
>  "across"
>                  # and "to" keywords
>                  across-from: Storage1
>                  sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>              end-sl2vl-scope
>          end-sl2vl-tables
> 
>      end-qos-setup
> 
> 
>      qos-levels
> 
>          # the first one is just setting SL
>          qos-level
>              use: for the lowest priority communication
>              sl: 15
>              packet-life: 16
>          end-qos-level
>          # the second sets SL and QoS Class
>          qos-level
>              use: low latency best bandwidth
>              sl: 0
>          end-qos-level
>          # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path 
>  Bits
>          qos-level
>              use: just an example
>              sl: 0
>              mtu-limit: 1
>              rate-limit: 1
>              packet-life: 12
>              # Path Bits can be used e.g. to provide a different routes 
>  through the
>              # subnet to a particular port
>              path-bits: 2,4,8-32
>          end-qos-level
> 
>      end-qos-levels
> 
> 
>      # Match rules are scanned in a first-fit manner (like firewall rules 
>  table)
>      qos-match-rules
> 
>          # matching by single criteria: class (list of values and ranges)
>          qos-match-rule
>              # just a description
>              use: low latency by class 7-9 or 11
>              qos-class: 7-9,11
>              # number of qos-level to apply to the matching PR/MPR
>              qos-level-sn: 1

Isn't it better and less error prone to match qos_level by name and not
by sequential number?

>          end-qos-match-rule
>          # show matching by destination group AND service-ids
>          qos-match-rule
>              use: Storage targets connection
>              destination: Storage
>              service-id: 22,4719-5000
>              qos-level-sn: 2
>          end-qos-match-rule
>          # show matching by source group only
>          qos-match-rule
>              use: bla bla
>              source: Storage
>              qos-level-sn: 3
>          end-qos-match-rule
> 
>      end-qos-match-rules
> 
> 
>  4. IPoIB
>  ---------
> 
>  IPoIB already query the SA for its broadcast group information. The 
>  additional
>  functionality required is for IPoIB to provide the broadcast group SL, MTU,
>  and RATE in every following PathRecord query performed when a new UDAV is
>  needed by IPoIB.
>  We could assign a special Service-ID for IPoIB use but since all 
>  communication
>  on the same IPoIB interface shares the same QoS-Level without the ability to
>  differentiate it by target service we can ignore it for simplicity.
> 
>  5. CMA features
>  ----------------
> 
>  The CMA interface supports Service-ID through the notion of port space as a
>  prefixes to the port_num which is part of the sockaddr provided to
>  rdma_resolve_add(). What is missing is the explicit request for a QoS-Class 
>  that
>  should allow the ULP (like SDP) to propagate a specific request for a class 
>  of
>  service. A mechanism for providing the QoS-Class is available in the IPv6 
>  address,
>  so we could use that address field. Another option is to implement a special
>  connection options API for CMA.
> 
>  Missing functionality by CMA is the usage of the provided QoS-Class and 
>  Service-ID
>  in the sent PR/MPR. When a response is obtained it is an existing 
>  requirement for
>  the CMA to use the PR/MPR from the response in setting up the QP address 
>  vector.
> 
> 
>  6. SDP
>  -------
> 
>  SDP uses CMA for building its connections.
>  The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
>  holding the remote TCP/IP Port Number to connect to.
>  SDP might be provided with SO_PRIORITY socket option. In that case the value
>  provided should be sent to the CMA as the TClass option of that connection.
> 
>  7. SRP
>  -------
> 
>  Current SRP implementation uses its own CM callbacks (not CMA). So SRP 
>  should
>  fill in the Service-ID in the PR/MPR by itself and use that information in
>  setting up the QP. The T10 SRP standard defines the SRP Service-ID to be 
>  defined
>  by the SRP target I/O Controller (but they should also comply with IBTA 
>  Service-
>  ID rules). Anyway, the Service-ID is reported by the I/O Controller in the
>  ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
>  reports its ability to handle QoS PR/MPRs.
> 
>  8. iSER
>  --------
>  iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER
>  should be TBD.
> 
> 
>  9. OpenSM features
>  -------------------
>  The QoS related functionality to be provided by OpenSM can be split into two
>  main parts:
> 
>  3.1. Fabric Setup
>  During fabric initialization the SM should parse the policy and apply its
>  settings to the discovered fabric elements. The following actions should be
>  performed:
>  * Parsing of policy
>  * Node Group identification. Warning should be provided for each node not
>    specified but found.
>  * SL2VL settings validation should be checked:
>    + A warning will be provided if there are no matching targets for the 
>  SL2VL
>      setting statement.
>    + An error message will be printed to the log file if an invalid setting 
>  is
>      found. A setting is invalid if it refers to:
>      - Non existing port numbers of the target devices
>      - Unsupported VLs for the target device. In the later case the map to 
>  non
>        existing VLs should be replaced to VL15 i.e. packets will be dropped.

I'm not sure it is optimal. We could have well documented or even
configurable mapping rule instead, then this will not limit devices with
higher capabilities.

>  * SL2VL setting is to be performed
>  * VL Arbitration table settings should be validated according to the 
>  following
>    rules:
>    + A warning will be provided if there are no matching targets for the 
>  setting
>      statement
>    + An error will be provided if the port number exceeds the target ports
>    + An error will be generated if the table length exceeds device 
>  capabilities

Ditto.

>    + A warning will be generated if the table quote a VL that is not 
>  supported
>      by the target device

What is "table quote" here?

>  * VL Arbitration tables will be set on the appropriate targets
> 
>  3.2. PR/MPR query handling:
>  OpenSM should be able to enforce the provided policy on client request.
>  The overall flow for such requests is: first the request is matched against 
>  the
>  defined match rules such that the target QoS-Level definition is found. 
>  Given
>  the QoS-Level a path(s) search is performed with the given restrictions 
>  imposed
>  by that level. The following two sections describe these steps.
> 
>  How Service-ID is carried in the PathRecord and MultiPathRecord attributes 
>  is
>  now standardized by the IBTA.
> 
> 
>  3.2.1. Matching rule search:
>  A rule is "matching" a PR/MPR request using the following criteria:
>  * Matching rules provide values in a list of either single value, or range 
>  of
>    values. A PR/MPR field is "matching" the rule field if it is explicitly
>    noted in the list of values or is one of the values covered by a range
>    included in the field values list.
>  * Only PR/MPR fields that have their component mask bit set should be
>    compared.
>  * For a rule to be "matching" a PR/MPR request all the rule fields should be
>    "matching" their PR/MPR fields. Such that a PR/MPR request that does
>    not have a component mask field set for one of the rule defined fields  
>  can
>    not match that rule.
>  * A PR/MPR request that have a component mask bit set for one of the fields
>    that is not defined by the rule can match the rule.

Aren't last two too restrictive? SA can just to filter-out paths in
response to match rest of the rule. No?

>  The algorithm to be used for searching for a rule match might be as simple 
>  as a
>  sequential search through all rules or enhanced for better performance. The
>  semantics of every rule field and its matching PR/MPR field are described
>  below:
>  * Source: the SGID or SLID should be part of this group
>  * Destination: the DGID or DLID should be part of this group
>  * Service-ID: check if the requested Service-ID (available in the PR/MPR old
>    SM-Key field) is matching any of this rule Service-IDs
>  * TClass: check if the PR/MPR TClass field is matching
> 
>  3.2.2 PR/MPR response generation:
>  The QoS-Level pointed by the first rule that matches the PR/MPR request
>  should be used for obtaining the response SL, MTU-Limit, RATE-Limit, 
>  Path-Bits
>  and QoS-Class. A default QoS-Level should be used if no rule is matching the 
>  query.

Where this default should be defined?

Sasha


>  The efficient algorithm for finding paths that meet the QoS-Level criteria 
>  is
>  beyond the scope of this RFC and left for the implementer to provide. 
>  However
>  the criteria by which the paths match the QoS-Level are described below:
> 
>  * SL: The paths found should all use the given SL. For that sake PR/MPR
>    algorithm should traverse the path from source to destination only through
>    ports that carry a valid VL (not VL15) by the SL2VL map (should consider 
>  input
>    and output ports and SL).
>  * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
>  * Rate-Limit: The resulting paths RATE should not exceed the given 
>  RATE-Limit
>    (rate limit is given in units of link BW = Width*Speed according to IBTA
>    Specification Vol-1 table-205 p-901 l-24).
>  * Path-Bits: define the target LID lowest bits (number of bits defined by 
>  the
>    target port PortInfo.LMC field). The path should traverse the LFT using 
>  the
>    target port LID with the path-bits set.
>  * QoS-Class: should be returned in the result PR/MPR. When routing is going 
>  to
>    be supported by OpenSM we might use this field in selecting the target
>    router too in a TBD way.
> 
>  _______________________________________________
>  general mailing list
>  general at lists.openfabrics.org
>  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
>  To unsubscribe, please visit 
>  http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list