[openib-general] QoS RFC - Resend using a friendly mailer

Eitan Zahavi eitan at mtlpx01.yok.mtl.com
Tue May 30 07:53:29 PDT 2006


To: OPENIB <openib-general at openib.org>
Subject: QoS RFC - Resend using a friendly mailer
--text follows this line--
Hi All 

Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack.
Your comments are welcome.

Eitan

              RFC: OpenFabrics Enhancements for QoS Support
             ===============================================

Authors: . Eitan Zahavi <eitan at mellanox.co.il>
Date: .... May 2006.
Revision:  0.1

Table of contents:
1. Overview
2. Architecture
3. Supported Policy
4. CMA functionality
5. IPoIB functionality
6. SDP functionality
7. SRP functionality
8. iSER functionality
9. OpenSM functionality

1. Overview
------------
Quality of Service requirements stem from the realization of I/O consolidation 
over IB network: As multiple applications and ULPs share the same fabric, means 
to control their use of the network resources are becoming a must. The basic 
need is to differentiate the service levels provided to different traffic flows. 
Such that a policy could be enforced and control each flow utilization of the 
fabric resources.

IBTA specification defined several hardware features and management interfaces 
to support QoS:
* Up to 15 Virtual Lanes (VL) could carry traffic in a non-blocking manner
* Arbitration between traffic of different VL is performed by a 2 priority 
  levels weighted round robin arbiter. The arbiter is programmable with 
  a sequence of (VL, weight) pairs and maximal number of high priority credits 
  to be processed before low priority is served
* Packets carry class of service marking in the range 0 to 15 in their
  header SL field
* Each switch can map the incoming packet by its SL to a particular output
  VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
* The Subnet Administrator controls each communication flow parameters
  by providing them as a response to Path Record query

The IB QoS features provide the means to implement a DiffServ like architecture. 
DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic 
fabrics. 

This proposal provides the detailed functional definition for the various 
software elements that are required to enable a DiffServ like architecture over 
the OpenFabrics software stack.





2. Architecture
----------------
This proposal split the QoS functionality between the SM/SA, CMA and the various 
ULPS. We take the "chronology approach" to describe how the overall system 
works:

2.1. The network manager (human) provides a set of rules (policy) that defines 
how the network is being configured and how its resources are split to different 
QoS-Levels. The policy also define how to decide which QoS-Level each 
application or ULP or service use.

2.2. The SM analyzes the provided policy to see if it is realizable and performs 
the necessary fabric setup. The SM may continuously monitor the policy and adapt 
to changes in it. Part of this policy defines the default QoS-Level of each 
partition. The SA is being enhanced to match the requested Source, Destination, 
TClass, Service-ID (and optionally SL and priority) against the policy. So 
clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also 
enhanced to support setting up partitions with appropriate IPoIB broadcast 
group. This broadcast group carries its QoS attributes: TClass, SL, MTU and 
RATE.

2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the 
multicast group which forms the broadcast group of this partition.

2.4. MPI which provides non IB based connection management should be configured 
to run using hard coded SLs. It uses these SLs in every QP being opened.

2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned 
Service-ID and use it while obtaining PathRecord for establishing their 
connections. The SA receiving the PathRecord should match it against the policy 
and return the appropriate PathRecord including SL, MTU, RATE and TClass. 

2.6. ULPs and programs using CMA to establish RC connection should provide the 
CMA the target IP and Service-ID. Some of the ULPs might also provide TClass 
(E.g. for SDP sockets that are provided the TOS socket option). The CMA should 
then use the provided Service-ID and optional TClass and pass them in the 
PathRecord request. The resulting PathRecord should be used for configuring the 
connection QP.

PathRecord and MultiPathRecord enhancement for QoS: 
As mentioned above the PathRecord and MultiPathRecord attributes should be 
enhanced to carry the Service-ID which is a 64bit value. Given the existing 
definition for these attributes we propose to use the following fields for 
Service-ID:
* For PathRecord: use the first 2 reserved fields whicg are 32bits each 
  (component masks 0x1 and 0x2). Component mask 1 should be used to refer to the 
  merged Service-ID field
* For MultiPathRecord: use 2 reserved fields: 
  1. after the packet life (8 bits) which is component mask bit 0x10000 (17)
  2. the field before SDGID1 (56 bits) which is component mask bit 0x200000 (22)
  Once merged they should be selected using component mask bit 0x10000 (17)
A new capability bit should describe the SM QoS support in the SA class port 
info. This approach provides an easy migration path for existing access layer 
and ULPs by not introducing a new attribute.


3. Supported Policy
-------------------- 

The QoS policy supported by this proposal is divided into 4 sub sections:

* Node Group: a set of HCAs, Routers or Switches that share the same settings. 
A node groups might be a partition defined by the partition manager policy in 
terms of GUIDs. Future implementations might provide support for NodeDescription 
based definition of node groups.

* Fabric Setup: 
Defines how the SL2VL and VLArb tables should be setup. This policy definition 
assumes the computation of target behavior should be performed outside of 
OpenSM.

* QoS-Levels Definition:
This section defines the possible sets of parameters for QoS that a client might 
be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, Path Bits 
(in case LMC > 0 is used for QoS) and TClass.

* Matching Rules:
A list of rules that match an incoming PathRecord request to a QoS-Level. The 
rules are processed in order such as the first match is applied. Each rule is 
built out of set of match expressions which should all match for the rule to 
apply. The matching expressions are defined for the following fields
** SRC and DST to lists of node groups
** Service-ID to a list of Service-ID or Service-ID ranges
** TClass to a list of TClass values or ranges

XML style syntax is provided for the policy file. However, a strict BNF format 
(provided in section 8) should be used for parsing it.

<?xml version="1.0" encoding="ISO-8859-1"?>
<qos-policy>
 <!-- Port Groups define sets of ports to be used later in the settings -->
 <port-groups>
  <!-- using port GUIDs -->
  <port-group> <name>Storage</name> <use>our SRP storage targets</use>
   <port-guid>0x1000000000000001</port-guid>
   <port-guid>0x1000000000000002</port-guid>
  </port-group>
  <!-- using names obtained by concatenation of first 2 words of NodeDescription
    and port number -->
  <port-group> <name>Virtual Servers</name> <use>node desc and IB port #</use>
   <port-name>vs1/HCA-1/P1</port-name>
   <port-name>vs3/HCA-1/P1</port-name>
   <port-name>vs3/HCA-2/P1</port-name>
  </port-group>
  <!-- using partitions defined in the partition policy -->
  <port-group> <name>Partition 1</name> <use>default settings</use>
   <partition>Part1</partition> 
  </port-group>
  <!-- using node types HCA|ROUTER|SWITCH -->
  <port-group> <name>Routers</name> <use>all routers</use>
   <node-type>ROUTER</node-type> 
  </port-group>  
 </port-groups>
 <qos-setup>
  <!-- define all types of SL2VL tables always have 16 VL entries -->
  <sl2vl-tables>
   <!-- scope defines the exact devices and in/out ports the tables apply to
    if the same port is matching several rules the last one applies -->
   <sl2vl-scope> <group>Part1</group> <from>*</from> <to>*</to> 
     <sl2vl-table>0,1,2,3,4,5,6,7,8,9,10,11,12,13,14</sl2vl-table>
   </sl2vl-scope>
   <!-- "across" means the port just connected to the given group, 
     also the link across port 1 is probably supporting only 2 VLs -->
   <sl2vl-scope> <across>Storage</across> <from>*</from> <to>1</to>
     <sl2vl-table>0,1,1,1,1,1,1,1,1,1,1,1,1,1,1</sl2vl-table>
   </sl2vl-scope>
  <sl2vl-tables>

  <!-- define all types of VLArb tables. The length of the tables should 
   match the physically supported tables by their target ports -->
  <vlarb-tables>
   <!-- scope defines the exact ports the VLArb tables apply to -->
   <vlarb-scope> <group>Storage</group> <to>*</to>
     <!-- VLArb table holds VL and weight pairs -->
     <vlarb-high>0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1</vlarb-high>
     <vlarb-low>8:255,9:127,10:63,11:31,12:15,13:7,14:3</vlarb-low>
     <vl-high-limit>10</vl-high-limit>
   </vlarb-scope>
  </vlarb-tables>
 </qos-setup>

<qos-levels>
  <!-- the first one is just setting SL -->
  <qos-level> <sn>1</sn> <use>for the lowest priority comm</use>
    <sl>16</sl>
  </qos-level>
  <!-- the second sets SL and TClass -->
  <qos-level> <sn>2</sn> <use>low latency best bandwidth</use>
    <sl>0</sl> <tclass>7</tclass>
  </qos-level>
  <!-- the whole set: SL, TClass, MTU-Limit, Rate-Limit, Path-Bits  -->
  <qos-level> <sn>3</sn> <use>just an example</use>
    <sl>0</sl> <tclass>32</tclass> <mtu_limit>1</mtl_limit> 
    <rate_limit>1</rate_limit>
  </qos-level>
 </qos-levels>

 <qos_match_rules>
  <!-- matching by single criteria: tclass (list of values and ranges) -->
  <qos_match_rule> <sn>1</sn> <use>low latency by tclass 7-9 or 11></use>
   <tclass>7-9,11</tclass> <match-level>1</match-level>
  </qos_match_rule>
  <!-- show matching by destination group AND service-ids -->
  <qos_match_rule> <sn>2</sn> <use>Storage targets connection></use>
   <destination>Storage</destination> <service>22,4719</service>
   <match-level>3</match-level>
  </qos_match_rule>
 </qos_match_rules>

</qos-policy>


4. IPoIB
---------

IPoIB already query the SA for its broadcast group information. The additional 
functionality required is for IPoIB to provide the broadcast group SL, MTU, RATE 
and TClass in every following PathRecord query performed when a new UDAV is 
needed by IPoIB. 
We could assign a special Service-ID for IPoIB use but since all communication 
on the same IPoIB interface shares the same QoS-Level without the ability to 
differentiate it by target service we can ignore it for simplicity.

5. CMA features
----------------

The CMA interface supports Service-ID through the notion of port space as a 
prefixes to the port_num which is part of the sockaddr provided to 
rdma_resolve_add(). What is missing is the explicit request for a TClass that 
should allow the ULP (like SDP) to propagate a specific request for a class of 
service. A mechanism for providing the TClass is available in the IPv6 address, 
so we could use that address field. Another option is to implement a special 
connection options API for CMA.

Missing functionality by CMA is the usage of the provided TClass and Service-ID 
in the sent PathRecord. When a response is obtained it is an existing 
requirement for the CMA to use the PathRecord from the response in setting up 
the QP address vector.


6. SDP
-------

SDP uses CMA for building its connections. 
The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits 
holding the remote TCP/IP Port Number to connect to.
SDP might be provided with SO_PRIORITY socket option. In that case the value 
provided should be sent to the CMA as the TClass option of that connection. 

7. SRP
-------

Current SRP implementation uses its own CM callbacks (not CMA). So SRP should 
fill in the Service-ID in the PathRecord by itself and use that information in 
setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined 
by the SRP target I/O Controller (but they should also comply with IBTA Service-
ID rules). Anyway, the Service-ID is reported by the I/O Controller in the 
ServiceEntries DMA attribute and should be used in the PathRecord if the SA 
reports its ability to handle QoS PathRecords.

8. iSER
--------
iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER 
should be TBD.


9. OpenSM features
-------------------
The QoS related functionality to be provided by OpenSM can be split into two 
main parts:

3.1. Fabric Setup
During fabric initialization the SM should parse the policy and apply its 
settings to the discovered fabric elements. The following actions should be 
performed:
* Parsing of policy
* Node Group identification. Warning should be provided for each node not 
  specified but found.
* SL2VL settings validation should be checked:
  + A warning will be provided if there are no matching targets for the SL2VL 
    setting statement. 
  + An error message will be printed to the log file if an invalid setting is 
    found. A setting is invalid if it refers to:
    - Non existing port numbers of the target devices
    - Unsupported VLs for the target device. In the later case the map to non
      existing VLs should be replaced to VL15 i.e. packets will be dropped.
* SL2VL setting is to be performed
* VL Arbitration table settings should be validated according to the following 
  rules:
  + A warning will be provided if there are no matching targets for the setting 
    statement
  + An error will be provided if the port number exceeds the target ports
  + An error will be generated if the table length exceeds device capabilities
  + An warning will be generated if the table quote a VL that is not supported 
    by the target device
* VL Arbitration tables will be set on the appropriate targets

3.2. PathRecord query handling:
OpenSM should be able to enforce the provided policy on client request.
The overall flow for such requests is: first the request is matched against the 
defined match rules such that the target QoS-Level definition is found. Given 
the QoS-Level a path(s) search is performed with the given restrictions imposed 
by that level. The following two sections describe these steps.

One issue not standardized by the IBTA is how Service-ID is carried in the 
PathRecord and MultiPathRecord attributes. There are basically two options:
a.	Replace the SM-Key field by the Service-ID. In that case no component mask
   bit will be assigned to it. Such that if the field is zero we should treat it 
   as if the component mask bit is clear. 
b. Encode it into spare fields. For PathRecord the first two fields are reserved 
   and are 64 bit when combined. The first component mask bit maps to the first 
   reserved field and should be used for Service-ID masking. For MultiPathRecord 
   attribute there are no adjacent reserve fields that makes a 64 bit field. So 
   the reserve field following the packet-lifetime (8 bits) combined with the 
   reserved field DGIDCount (56 bits) can make the Service-ID. In this case also  
   the first reserve field component mask bit should be used as the Service-ID 
   component mask bit.



3.2.1. Matching rule search:
A rule is "matching" a PathRecord request using the following criteria:
* Matching rules provide values in a list of either single value, or range of    
  values. A PathRecord field is "matching" the rule field if it is explicitly    
  noted in the list of values or is one of the values covered by a range 
  included in the field values list.
* Only PathRecord fields that have their component mask bit set should be
  compared. 
* For a rule to be "matching" a PathRecord request all the rule fields should be 
  "matching" their PathRecord fields. Such that a PathRecord request that does 
  not have a component mask field set for one of the rule defined fields  can 
  not match that rule.
* A PathRecord request that have a component mask bit set for one of the fields 
  that is not defined by the rule  can match the rule. 

The algorithm to be used for searching for a rule match might be as simple as a 
sequential search through all rules or enhanced for better performance. The 
semantics of every rule field and its matching PathRecord field are described 
below:
* Source: the SGID or SLID should be part of this group
* Destination: the DGID or DLID should be part of this group
* Service-ID: check if the requested Service-ID (available in the PathRecord old 
  SM-Key field) is matching any of this rule Service-IDs
* TClass: check if the PathRecord TClass field is matching

3.2.2 PathRecord response generation:
The QoS-Level pointed by the first rule that matches the PathRecord request 
should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits 
and TClass. A default QoS-Level should be used if no rule is matching the query.

The efficient algorithm for finding paths that meet the QoS-Level criteria is 
beyond the scope of this RFC and left for the implementer to provide. However 
the criteria by which the paths match the QoS-Level are described below:

* SL: The paths found should all use the given SL. For that sake PathRecord 
  algorithm should traverse the path from source to destination only through 
  ports that carry a valid VL (not VL15) by the SL2VL map (should consider input 
  and output ports and SL). 
* MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
* Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit 
  (rate limit is given in units of link BW = Width*Speed according to IBTA  
  Specification Vol-1 table-205 p-901 l-24).
* Path-Bits: define the target LID lowest bits (number of bits defined by the 
  target port PortInfo.LMC field). The path should traverse the LFT using the 
  target port LID with the path-bits set.
* TClass: should be returned in the result PathRecord. When routing is going to 
  be supported by OpenSM we might use this field in selecting the target 
  router too in a TBD way.




More information about the general mailing list