[ofa-general] QoS RFC

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Thu Jul 26 05:39:36 PDT 2007


Hi Sasha,

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> Some initial comments.
> 
> On 01:07 Sun 22 Jul     , Yevgeny Kliteynik wrote:
>>  Hi All
>>
>>  Please find the attached RFC describing how QoS policy support could be 
>>  implemented in the OpenFabrics stack.
>>  Your comments are welcome.
>>
>>  -- Yevgeny
>>
>>                RFC: OpenFabrics Enhancements for QoS Support
>>               ===============================================
>>
>>  Authors: . Eitan Zahavi <eitan at mellanox.co.il>
>>  Authors: . Yevgeny Kliteynik <kliteyn at mellanox.co.il>
>>  Date: .... Jul 2007.
>>  Revision:  0.2
>>
>>  Table of contents:
>>  1. Overview
>>  2. Architecture
>>  3. Supported Policy
>>  4. CMA functionality
>>  5. IPoIB functionality
>>  6. SDP functionality
>>  7. SRP functionality
>>  8. iSER functionality
>>  9. OpenSM functionality
>>
>>  1. Overview
>>  ------------
>>  Quality of Service requirements stem from the realization of I/O 
>>  consolidation
>>  over IB network: As multiple applications and ULPs share the same fabric, 
>>  means
>>  to control their use of the network resources are becoming a must. The basic
>>  need is to differentiate the service levels provided to different traffic 
>>  flows,
>>  such that a policy could be enforced and control each flow utilization of 
>>  the
>>  fabric resources.
>>
>>  IBTA specification defined several hardware features and management 
>>  interfaces
>>  to support QoS:
>>  * Up to 15 Virtual Lanes (VL) carry traffic in a non-blocking manner
>>  * Arbitration between traffic of different VLs is performed by a 2 priority
>>    levels weighted round robin arbiter. The arbiter is programmable with
>>    a sequence of (VL, weight) pairs and maximal number of high priority 
>>  credits
>>    to be processed before low priority is served
>>  * Packets carry class of service marking in the range 0 to 15 in their
>>    header SL field
>>  * Each switch can map the incoming packet by its SL to a particular output
>>    VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
>>  * The Subnet Administrator controls each communication flow parameters
>>    by providing them as a response to Path Record (PR) or MultiPathRecord 
>>  (MPR)
>>    queries
>>
>>  The IB QoS features provide the means to implement a DiffServ like 
>>  architecture.
>>  DiffServ architecture (IETF RFC2474 2475) is widely used today in highly 
>>  dynamic
>>  fabrics.
>>
>>  This proposal provides the detailed functional definition for the various
>>  software elements that are required to enable a DiffServ like architecture 
>>  over
>>  the OpenFabrics software stack.
>>
>>
>>
>>  2. Architecture
>>  ----------------
>>  This proposal split the QoS functionality between the SM/SA, CMA and the 
>>  various
>>  ULPS. We take the "chronology approach" to describe how the overall system
>>  works:
>>
>>  2.1. The network manager (human) provides a set of rules (policy) that 
>>  defines
>>  how the network is being configured and how its resources are split to 
>>  different
>>  QoS-Levels. The policy also define how to decide which QoS-Level each
>>  application or ULP or service use.
>>
>>  2.2. The SM analyzes the provided policy to see if it is realizable and 
>>  performs
>>  the necessary fabric setup. The SM may continuously monitor the policy and 
>>  adapt
>>  to changes in it. Part of this policy defines the default QoS-Level of each
>>  partition. The SA is being enhanced to match the requested Source, 
>>  Destination,
>>  QoS-Class, Service-ID (and optionally SL and priority) against the policy. 
>>  So
>>  clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also
>>  enhanced to support setting up partitions with appropriate IPoIB broadcast
>>  group. This broadcast group carries its QoS attributes: SL, MTU and
>>  RATE.
>>
>>  2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the
>>  multicast group which forms the broadcast group of this partition.
>>
>>  2.4. MPI which provides non IB based connection management should be 
>>  configured
>>  to run using hard coded SLs. It uses these SLs for every QP being opened.
>>
>>  2.5. ULPs that use CM interface (like SRP) should have their own 
>>  pre-assigned
>>  Service-ID and use it while obtaining PR/MPR for establishing connections.
>>  The SA receiving the PR/MPR should match it against the policy and return
>>  the appropriate PR/MPR including SL, MTU and RATE.
>>
>>  2.6. ULPs and programs using CMA to establish RC connection should provide 
>>  the
>>  CMA the target IP and Service-ID. Some of the ULPs might also provide 
>>  QoS-Class
>>  (E.g. for SDP sockets that are provided the TOS socket option). The CMA 
>>  should
>>  then use the provided Service-ID and optional QoS-Class and pass them in the
>>  PR/MPR request. The resulting PR/MPR should be used for configuring the
>>  connection QP.
>>
>>  PathRecord and MultiPathRecord enhancement for QoS:
>>  As mentioned above the PathRecord and MultiPathRecord attributes should be
>>  enhanced to carry the Service-ID which is a 64bit value, which has been
>>  standardized by the IBTA. A new field QoS-Class is also provided.
>>  A new capability bit should describe the SM QoS support in the SA class port
>>  info. This approach provides an easy migration path for existing access 
>>  layer
>>  and ULPs by not introducing new set of PR/MPR attribute.
>>
>>
>>  3. Supported Policy
>>  --------------------
>>
>>  The QoS policy supported by this proposal is divided into 4 sub sections:
>>
>>  I) Port Group: a set of CAs, Routers or Switches that share the same 
>>  settings.
>>  A port group might be a partition defined by the partition manager policy in
>>  terms of GUIDs. Future implementations might provide support for 
>>  NodeDescription
>>  based definition of port groups.
> 
> Isn't it better to have port group definitions in separate file? So
> groups could be shared with other OpenSM components (as discussed). Even
> if such group sharing is not high priority functionality this should
> save us from redoing things later.
> 
>>  II) Fabric Setup:
>>  Defines how the SL2VL and VLArb tables should be setup. This policy 
>>  definition
>>  assumes the computation of overall end to end network behavior should be 
>>  performed
>>  outside of OpenSM.
>>
>>  III) QoS-Levels Definition:
>>  This section defines the possible sets of parameters for QoS that a client
>>  might be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
>>  Packet Lifetime and Path Bits (in case LMC > 0 is used for QoS).
>>
>>  IV) Matching Rules:
>>  A list of rules that match an incoming PR/MPR request to a QoS-Level. The
>>  rules are processed in order such as the first match is applied. Each rule 
>>  is
>>  built out of a set of match expressions which should all match for the rule 
>>  to
>>  apply. The matching expressions are defined for the following fields
>>  ** SRC and DST to lists of port groups
>>  ** Service-ID to a list of Service-ID or Service-ID ranges
>>  ** QoS-Class to a list of QoS-Class values or ranges
>>
>>  QoS Policy file syntax
>>
>>  * Empty lines are ignored
>>  * Leading and trailing blanks, as well as empty lines, are ignored, so the
>>    indentation in the example is just for better readability
>>  * Comments are started with the pound sign (#) and terminated by EOL
>>  * Comments may appear only in a separate line
> 
> Why? What is wrong with:
> 
> 	port-name: vs1/HCA-1/P1   # my best port

I can use this too, but then the pound sign, wherever it will
appear, would mean commentary start. No \# or something like this
to include it in some other place - I don't want to complicate the
syntax. Sounds OK?


>>  * Keywords that denote section/subsection start have matching closing 
>>  keywords
>>  * Any keyword should be the first non-blank in the line
>>
>>  QoS Policy file example
>>
>>      # Port Groups define sets of ports to be used later in the settings
>>      port-groups
>>          # using port GUIDs
>>          port-group
>>              name: Storage
>>              # "use" is just a description that is used for logging.
>>              #  Other than that, it is just a commentary
>>              use: our SRP storage targets
>>              port-guid: 0x1000000000000001
>>              port-guid: 0x1000000000000002
>>          end-port-group
>>
>>          port-group
>>              name: Virtual Servers
>>              use: node desc and IB port num
>>              # The syntax of the port name is as follows: 
>>  "hostname/CA-num/Pnum".
>>              # "hostname" and "CA-num" are compared to the first 2 words of
>>              # NodeDescription, and "Pnum" is a port number on that node.
>>              port-name: vs1/HCA-1/P1
>>              port-name: vs3/HCA-1/P1
>>              port-name: vs3/HCA-2/P2
> 
> What about wild carding here, like vs1/*/* or just vs1?

Good idea.

>>          end-port-group
>>
>>          # using partitions defined in the partition policy
>>          port-group
>>              name: Group for Partition 1
>>              use: default settings
>>              partition: Part1
>>          end-port-group
>>
>>          # using node types CA|ROUTER|SWITCH
> 
> Probably also ALL (for all ports), SELF (for SM port)?

Agree.

>>          port-group
>>              name: Routers
>>              use: all routers
>>              node-type: ROUTER
>>          end-port-group
>>
>>      end-port-groups
> 
> I agree that proposed syntax has better for human readability than pure
> XML, but isn't stuff like this will be more user-friendly?
> 
> Storage "Free Text description" = 0x10001, 0x10002, 0x10003 ;
> 
> , or
> 
> Storage "Free Text description" { 0x10001, 0x10002, 0x10003 };
> 
> , or
> 
> Storage "Free Text description": ROUTERS, CAS ;

GUID list is a good idea.
Not sure about the other stuff. A certain port group can be defined
both by guids and by node-types. How about this:

           port-group
               name: routers_and_mgt_nodes
               use: all routers and management nodes
               node-type: ROUTER
               port-guid: 0x10001, 0x10002, 0x10003
           end-port-group

>>      qos-setup
>>
>>          # define all types of VLArb tables. The length of the tables should
>>          # match the physically supported tables by their target ports
>>          vlarb-tables
>>              # scope defines the exact ports the VLArb tables apply to
>>              vlarb-scope
>>                  # defining VLArb tables on all the ports that belong to
>>                  # port group 'Storage', and on all the ports connected
>>                  # to ports of port group 'Storage'
>>                  group: Storage
> 
> So "group" is only for ports that belong to 'Storage'?

Yes, and "across" is for ports that connected to ports of group 'Storage'

>>                  # "across" means all the ports that are connected to ports
>>                  # that belong to the specified port group
>>                  across: Storage
>>                  # VLArb table holds VL and weight pairs
>>                  vlarb-high: 0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1
>>                  vlarb-low: 8:255,9:127,10:63,11:31,12:15,13:7,14:3
>>                  vl-high-limit: 10
>>              end-vlarb-scope
>>              # There can be several scopes
>>          end-vlarb-tables
>>
>>          sl2vl-tables
>>              # Scope defines the exact devices and in/out ports tables apply 
>>  to.
>>              # Note: if the same port is matching several rules the *FIRST* 
>>  one applies.
>>              sl2vl-scope
>>                  # SL2VL tables are orgnized as SL2VL(in-port,out-port)
>>                  # "from: n,m" means we define the SL2VL(n,*) and SL2VL(m,*)
>>                  # "to: n,m" means we define the SL2VL(*,n) and SL2VL(*,m)
>>                  #
>>                  # The following example specifies that all the SL2VL tables
>>                  # entries should be defined for all the ports of group 
>>  Part1:
>>                  group: Part1
>>                  from: *
>>                  to: *
>>                  # SL2VL table has to have 16 values at max - one for each 
>>  SL.
>>                  # If the user specifies less than 16 values, all the missing
>>                  # VL values will be implicitly set to 0
>>                  sl2vl-table: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
>>              end-sl2vl-scope
>>
>>              sl2vl-scope
>>                  # "across-to" is a combination of "across" keyword 
>>  (definition can be found
>>                  # in VLArb tables section) and "to" keyword.
>>                  # "across: PortGroupName" refers to all the ports that are 
>>  connected
>>                  # to ports that belong to PortGroupName.
>>                  #
>>                  # Example of "across-to" usage:
>>                  #   A user has a set of 'special' nodes (e.g. storage 
>>  nodes), and all
>>                  #   the traffic to these nodes has to get specific VL.
>>                  #   The solution is to define port group (i.g. "Storage") 
>>  that will
>>                  #   include all the ports of these nodes, and then to 
>>  configure SL2VL
>>                  #   tables on all the switch ports that are connected to the 
>>  Storage
>>                  #   port group by specifying "across-to: Storage".
>>                  #
>>                  across-to: Storage2
>>                  # Similar to "across-to", "across-from" is a combination of 
>>  "across"
>>                  # and "to" keywords
>>                  across-from: Storage1
>>                  sl2vl-table: 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0
>>              end-sl2vl-scope
>>          end-sl2vl-tables
>>
>>      end-qos-setup
>>
>>
>>      qos-levels
>>
>>          # the first one is just setting SL
>>          qos-level
>>              use: for the lowest priority communication
>>              sl: 15
>>              packet-life: 16
>>          end-qos-level
>>          # the second sets SL and QoS Class
>>          qos-level
>>              use: low latency best bandwidth
>>              sl: 0
>>          end-qos-level
>>          # the whole set: SL, MTU-Limit, Rate-Limit, Packet Lifetime, Path 
>>  Bits
>>          qos-level
>>              use: just an example
>>              sl: 0
>>              mtu-limit: 1
>>              rate-limit: 1
>>              packet-life: 12
>>              # Path Bits can be used e.g. to provide a different routes 
>>  through the
>>              # subnet to a particular port
>>              path-bits: 2,4,8-32
>>          end-qos-level
>>
>>      end-qos-levels
>>
>>
>>      # Match rules are scanned in a first-fit manner (like firewall rules 
>>  table)
>>      qos-match-rules
>>
>>          # matching by single criteria: class (list of values and ranges)
>>          qos-match-rule
>>              # just a description
>>              use: low latency by class 7-9 or 11
>>              qos-class: 7-9,11
>>              # number of qos-level to apply to the matching PR/MPR
>>              qos-level-sn: 1
> 
> Isn't it better and less error prone to match qos_level by name and not
> by sequential number?

qos-level can have name, and then qos-match-rule will refer to this name.
But matching qos-level by sequential number makes it really easy to locate
the referred qos-level, which is important, as every PR/MPR request would
go through this process, so saving some runtime in this area is important IMHO.

>>          end-qos-match-rule
>>          # show matching by destination group AND service-ids
>>          qos-match-rule
>>              use: Storage targets connection
>>              destination: Storage
>>              service-id: 22,4719-5000
>>              qos-level-sn: 2
>>          end-qos-match-rule
>>          # show matching by source group only
>>          qos-match-rule
>>              use: bla bla
>>              source: Storage
>>              qos-level-sn: 3
>>          end-qos-match-rule
>>
>>      end-qos-match-rules
>>
>>
>>  4. IPoIB
>>  ---------
>>
>>  IPoIB already query the SA for its broadcast group information. The 
>>  additional
>>  functionality required is for IPoIB to provide the broadcast group SL, MTU,
>>  and RATE in every following PathRecord query performed when a new UDAV is
>>  needed by IPoIB.
>>  We could assign a special Service-ID for IPoIB use but since all 
>>  communication
>>  on the same IPoIB interface shares the same QoS-Level without the ability to
>>  differentiate it by target service we can ignore it for simplicity.
>>
>>  5. CMA features
>>  ----------------
>>
>>  The CMA interface supports Service-ID through the notion of port space as a
>>  prefixes to the port_num which is part of the sockaddr provided to
>>  rdma_resolve_add(). What is missing is the explicit request for a QoS-Class 
>>  that
>>  should allow the ULP (like SDP) to propagate a specific request for a class 
>>  of
>>  service. A mechanism for providing the QoS-Class is available in the IPv6 
>>  address,
>>  so we could use that address field. Another option is to implement a special
>>  connection options API for CMA.
>>
>>  Missing functionality by CMA is the usage of the provided QoS-Class and 
>>  Service-ID
>>  in the sent PR/MPR. When a response is obtained it is an existing 
>>  requirement for
>>  the CMA to use the PR/MPR from the response in setting up the QP address 
>>  vector.
>>
>>
>>  6. SDP
>>  -------
>>
>>  SDP uses CMA for building its connections.
>>  The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits
>>  holding the remote TCP/IP Port Number to connect to.
>>  SDP might be provided with SO_PRIORITY socket option. In that case the value
>>  provided should be sent to the CMA as the TClass option of that connection.
>>
>>  7. SRP
>>  -------
>>
>>  Current SRP implementation uses its own CM callbacks (not CMA). So SRP 
>>  should
>>  fill in the Service-ID in the PR/MPR by itself and use that information in
>>  setting up the QP. The T10 SRP standard defines the SRP Service-ID to be 
>>  defined
>>  by the SRP target I/O Controller (but they should also comply with IBTA 
>>  Service-
>>  ID rules). Anyway, the Service-ID is reported by the I/O Controller in the
>>  ServiceEntries DMA attribute and should be used in the PR/MPR if the SA
>>  reports its ability to handle QoS PR/MPRs.
>>
>>  8. iSER
>>  --------
>>  iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER
>>  should be TBD.
>>
>>
>>  9. OpenSM features
>>  -------------------
>>  The QoS related functionality to be provided by OpenSM can be split into two
>>  main parts:
>>
>>  3.1. Fabric Setup
>>  During fabric initialization the SM should parse the policy and apply its
>>  settings to the discovered fabric elements. The following actions should be
>>  performed:
>>  * Parsing of policy
>>  * Node Group identification. Warning should be provided for each node not
>>    specified but found.
>>  * SL2VL settings validation should be checked:
>>    + A warning will be provided if there are no matching targets for the 
>>  SL2VL
>>      setting statement.
>>    + An error message will be printed to the log file if an invalid setting 
>>  is
>>      found. A setting is invalid if it refers to:
>>      - Non existing port numbers of the target devices
>>      - Unsupported VLs for the target device. In the later case the map to 
>>  non
>>        existing VLs should be replaced to VL15 i.e. packets will be dropped.
> 
> I'm not sure it is optimal. We could have well documented or even
> configurable mapping rule instead, then this will not limit devices with
> higher capabilities.

I'm open for suggestions.

>>  * SL2VL setting is to be performed
>>  * VL Arbitration table settings should be validated according to the 
>>  following
>>    rules:
>>    + A warning will be provided if there are no matching targets for the 
>>  setting
>>      statement
>>    + An error will be provided if the port number exceeds the target ports
>>    + An error will be generated if the table length exceeds device 
>>  capabilities
> 
> Ditto.
> 
>>    + A warning will be generated if the table quote a VL that is not supported
>>      by the target device
> 
> What is "table quote" here?
>>  * VL Arbitration tables will be set on the appropriate targets
>>
>>  3.2. PR/MPR query handling:
>>  OpenSM should be able to enforce the provided policy on client request.
>>  The overall flow for such requests is: first the request is matched against 
>>  the
>>  defined match rules such that the target QoS-Level definition is found. 
>>  Given
>>  the QoS-Level a path(s) search is performed with the given restrictions 
>>  imposed
>>  by that level. The following two sections describe these steps.
>>
>>  How Service-ID is carried in the PathRecord and MultiPathRecord attributes 
>>  is
>>  now standardized by the IBTA.
>>
>>
>>  3.2.1. Matching rule search:
>>  A rule is "matching" a PR/MPR request using the following criteria:
>>  * Matching rules provide values in a list of either single value, or range 
>>  of
>>    values. A PR/MPR field is "matching" the rule field if it is explicitly
>>    noted in the list of values or is one of the values covered by a range
>>    included in the field values list.
>>  * Only PR/MPR fields that have their component mask bit set should be
>>    compared.
>>  * For a rule to be "matching" a PR/MPR request all the rule fields should be
>>    "matching" their PR/MPR fields. Such that a PR/MPR request that does
>>    not have a component mask field set for one of the rule defined fields  
>>  can
>>    not match that rule.
>>  * A PR/MPR request that have a component mask bit set for one of the fields
>>    that is not defined by the rule can match the rule.
> 
> Aren't last two too restrictive? SA can just to filter-out paths in
> response to match rest of the rule. No?

Not sure I'm following.
The last bullet is not restrictive at all - it says that if you have a match
rule with some reduced set of fields (e.g. only service id), any PR/MPR with
a matching service id will be matched, even if it also has MTU, rate, etc.

>>  The algorithm to be used for searching for a rule match might be as simple 
>>  as a
>>  sequential search through all rules or enhanced for better performance. The
>>  semantics of every rule field and its matching PR/MPR field are described
>>  below:
>>  * Source: the SGID or SLID should be part of this group
>>  * Destination: the DGID or DLID should be part of this group
>>  * Service-ID: check if the requested Service-ID (available in the PR/MPR old
>>    SM-Key field) is matching any of this rule Service-IDs
>>  * TClass: check if the PR/MPR TClass field is matching
>>
>>  3.2.2 PR/MPR response generation:
>>  The QoS-Level pointed by the first rule that matches the PR/MPR request
>>  should be used for obtaining the response SL, MTU-Limit, RATE-Limit, 
>>  Path-Bits
>>  and QoS-Class. A default QoS-Level should be used if no rule is matching the 
>>  query.
> 
> Where this default should be defined?

OK, I missed that part. Here it is:

  - qos-level sequential number is counted from 0
  - qos-level num. 0 is a must is treated as the Default Level - it's
    applied to any PR/MPR request that didn't match any match rule
  - default qos-level can be also referred explicitly in any match rule
    by specifying "qos-level-sn: 0"

-- Yevgeny

> Sasha
> 
> 
>>  The efficient algorithm for finding paths that meet the QoS-Level criteria 
>>  is
>>  beyond the scope of this RFC and left for the implementer to provide. 
>>  However
>>  the criteria by which the paths match the QoS-Level are described below:
>>
>>  * SL: The paths found should all use the given SL. For that sake PR/MPR
>>    algorithm should traverse the path from source to destination only through
>>    ports that carry a valid VL (not VL15) by the SL2VL map (should consider 
>>  input
>>    and output ports and SL).
>>  * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
>>  * Rate-Limit: The resulting paths RATE should not exceed the given 
>>  RATE-Limit
>>    (rate limit is given in units of link BW = Width*Speed according to IBTA
>>    Specification Vol-1 table-205 p-901 l-24).
>>  * Path-Bits: define the target LID lowest bits (number of bits defined by 
>>  the
>>    target port PortInfo.LMC field). The path should traverse the LFT using 
>>  the
>>    target port LID with the path-bits set.
>>  * QoS-Class: should be returned in the result PR/MPR. When routing is going 
>>  to
>>    be supported by OpenSM we might use this field in selecting the target
>>    router too in a TBD way.
>>
>>  _______________________________________________
>>  general mailing list
>>  general at lists.openfabrics.org
>>  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>>  To unsubscribe, please visit 
>>  http://openib.org/mailman/listinfo/openib-general
> 





More information about the general mailing list