[openib-general] QoS RFC - Resend using a friendly mailer

Hal Rosenstock halr at voltaire.com
Tue May 30 10:38:40 PDT 2006


On Tue, 2006-05-30 at 10:53, Eitan Zahavi wrote:
> To: OPENIB <openib-general at openib.org>
> Subject: QoS RFC - Resend using a friendly mailer
> --text follows this line--
> Hi All 
> 
> Please find the attached RFC describing how QoS policy support could be implemented in the OpenFabrics stack.
> Your comments are welcome.

Some initial comments.

> 
> Eitan
> 
>               RFC: OpenFabrics Enhancements for QoS Support
>              ===============================================
> 
> Authors: . Eitan Zahavi <eitan at mellanox.co.il>
> Date: .... May 2006.
> Revision:  0.1
> 
> Table of contents:
> 1. Overview
> 2. Architecture
> 3. Supported Policy
> 4. CMA functionality
> 5. IPoIB functionality
> 6. SDP functionality
> 7. SRP functionality
> 8. iSER functionality
> 9. OpenSM functionality
> 
> 1. Overview
> ------------
> Quality of Service requirements stem from the realization of I/O consolidation 
> over IB network: As multiple applications and ULPs share the same fabric, means 
> to control their use of the network resources are becoming a must. The basic 
> need is to differentiate the service levels provided to different traffic flows. 
> Such that a policy could be enforced and control each flow utilization of the 
> fabric resources.
> 
> IBTA specification defined several hardware features and management interfaces 
> to support QoS:
> * Up to 15 Virtual Lanes (VL) could carry traffic in a non-blocking manner
> * Arbitration between traffic of different VL is performed by a 2 priority 
>   levels weighted round robin arbiter. The arbiter is programmable with 
>   a sequence of (VL, weight) pairs and maximal number of high priority credits 
>   to be processed before low priority is served
> * Packets carry class of service marking in the range 0 to 15 in their
>   header SL field
> * Each switch can map the incoming packet by its SL to a particular output
>   VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port, SL)
> * The Subnet Administrator controls each communication flow parameters
>   by providing them as a response to Path Record query
> 
> The IB QoS features provide the means to implement a DiffServ like architecture. 
> DiffServ architecture (IETF RFC2474 2475) is widely used today in highly dynamic 
> fabrics. 

Only certain DSCP code point equivalents are provided by IBA.

> This proposal provides the detailed functional definition for the various 
> software elements that are required to enable a DiffServ like architecture over 
> the OpenFabrics software stack.
> 
> 
> 
> 
> 
> 2. Architecture
> ----------------
> This proposal split the QoS functionality between the SM/SA, CMA and the various 
> ULPS. We take the "chronology approach" to describe how the overall system 
> works:
> 
> 2.1. The network manager (human) provides a set of rules (policy) that defines 
> how the network is being configured and how its resources are split to different 
> QoS-Levels. The policy also define how to decide which QoS-Level each 
> application or ULP or service use.

> 2.2. The SM analyzes the provided policy to see if it is realizable and performs 
> the necessary fabric setup. The SM may continuously monitor the policy and adapt 
> to changes in it.

Do you mean monitor the policy or the fabric here ?

>  Part of this policy defines the default QoS-Level of each 
> partition. The SA is being enhanced to match the requested Source, Destination, 
> TClass, Service-ID

Service ID does not apply to many ULPs. Also, how is it known what
ULP/application a particular service ID refers to (other than perhaps
some well known ones) ?

>  (and optionally SL and priority) against the policy. So 
> clients (ULPs, programs) can obtain a policy enforced QoS. The SM is also 
> enhanced to support setting up partitions with appropriate IPoIB broadcast 
> group. This broadcast group carries its QoS attributes: TClass, SL, MTU and 
> RATE.
> 
> 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available on the 
> multicast group which forms the broadcast group of this partition.
> 
> 2.4. MPI which provides non IB based connection management should be configured 
> to run using hard coded SLs. It uses these SLs in every QP being opened.
> 
> 2.5. ULPs that use CM interface (like SRP) should have their own pre-assigned 
> Service-ID and use it while obtaining PathRecord for establishing their 
> connections. The SA receiving the PathRecord should match it against the policy 
> and return the appropriate PathRecord including SL, MTU, RATE and TClass. 
> 
> 2.6. ULPs and programs using CMA to establish RC connection should provide the 
> CMA the target IP and Service-ID. Some of the ULPs might also provide TClass 
> (E.g. for SDP sockets that are provided the TOS socket option). The CMA should 
> then use the provided Service-ID and optional TClass and pass them in the 
> PathRecord request. The resulting PathRecord should be used for configuring the 
> connection QP.
> 
> PathRecord and MultiPathRecord enhancement for QoS: 
> As mentioned above the PathRecord and MultiPathRecord attributes should be 
> enhanced to carry the Service-ID which is a 64bit value. Given the existing 
> definition for these attributes we propose to use the following fields for 
> Service-ID:
> * For PathRecord: use the first 2 reserved fields whicg are 32bits each 
>   (component masks 0x1 and 0x2). Component mask 1 should be used to refer to the 
>   merged Service-ID field
> * For MultiPathRecord: use 2 reserved fields: 
>   1. after the packet life (8 bits) which is component mask bit 0x10000 (17)
>   2. the field before SDGID1 (56 bits) which is component mask bit 0x200000 (22)

This is not possible with the existing approved 1.2 erratum changes.

>   Once merged they should be selected using component mask bit 0x10000 (17)
> A new capability bit should describe the SM QoS support in the SA class port 
> info. This approach provides an easy migration path for existing access layer 
> and ULPs by not introducing a new attribute.
> 
> 
> 3. Supported Policy
> -------------------- 
> 
> The QoS policy supported by this proposal is divided into 4 sub sections:
> 
> * Node Group: a set of HCAs, Routers or Switches that share the same settings. 
> A node groups might be a partition defined by the partition manager policy in 
> terms of GUIDs. Future implementations might provide support for NodeDescription 
> based definition of node groups.
> 
> * Fabric Setup: 
> Defines how the SL2VL and VLArb tables should be setup. This policy definition 
> assumes the computation of target behavior should be performed outside of 
> OpenSM.
> 
> * QoS-Levels Definition:
> This section defines the possible sets of parameters for QoS that a client might 
> be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate, Path Bits 
> (in case LMC > 0 is used for QoS) and TClass.
> 
> * Matching Rules:
> A list of rules that match an incoming PathRecord request to a QoS-Level. The 
> rules are processed in order such as the first match is applied. Each rule is 
> built out of set of match expressions which should all match for the rule to 
> apply. The matching expressions are defined for the following fields
> ** SRC and DST to lists of node groups
> ** Service-ID to a list of Service-ID or Service-ID ranges
> ** TClass to a list of TClass values or ranges
> 
> XML style syntax is provided for the policy file. However, a strict BNF format 
> (provided in section 8)

What section ?

>  should be used for parsing it.
> 
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <qos-policy>
>  <!-- Port Groups define sets of ports to be used later in the settings -->
>  <port-groups>
>   <!-- using port GUIDs -->
>   <port-group> <name>Storage</name> <use>our SRP storage targets</use>
>    <port-guid>0x1000000000000001</port-guid>
>    <port-guid>0x1000000000000002</port-guid>
>   </port-group>
>   <!-- using names obtained by concatenation of first 2 words of NodeDescription
>     and port number -->
>   <port-group> <name>Virtual Servers</name> <use>node desc and IB port #</use>
>    <port-name>vs1/HCA-1/P1</port-name>
>    <port-name>vs3/HCA-1/P1</port-name>
>    <port-name>vs3/HCA-2/P1</port-name>
>   </port-group>
>   <!-- using partitions defined in the partition policy -->
>   <port-group> <name>Partition 1</name> <use>default settings</use>
>    <partition>Part1</partition> 
>   </port-group>
>   <!-- using node types HCA|ROUTER|SWITCH -->
>   <port-group> <name>Routers</name> <use>all routers</use>
>    <node-type>ROUTER</node-type> 
>   </port-group>  
>  </port-groups>
>  <qos-setup>
>   <!-- define all types of SL2VL tables always have 16 VL entries -->
>   <sl2vl-tables>
>    <!-- scope defines the exact devices and in/out ports the tables apply to
>     if the same port is matching several rules the last one applies -->
>    <sl2vl-scope> <group>Part1</group> <from>*</from> <to>*</to> 
>      <sl2vl-table>0,1,2,3,4,5,6,7,8,9,10,11,12,13,14</sl2vl-table>
>    </sl2vl-scope>
>    <!-- "across" means the port just connected to the given group, 
>      also the link across port 1 is probably supporting only 2 VLs -->
>    <sl2vl-scope> <across>Storage</across> <from>*</from> <to>1</to>
>      <sl2vl-table>0,1,1,1,1,1,1,1,1,1,1,1,1,1,1</sl2vl-table>
>    </sl2vl-scope>
>   <sl2vl-tables>
> 
>   <!-- define all types of VLArb tables. The length of the tables should 
>    match the physically supported tables by their target ports -->
>   <vlarb-tables>
>    <!-- scope defines the exact ports the VLArb tables apply to -->
>    <vlarb-scope> <group>Storage</group> <to>*</to>
>      <!-- VLArb table holds VL and weight pairs -->
>      <vlarb-high>0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1</vlarb-high>
>      <vlarb-low>8:255,9:127,10:63,11:31,12:15,13:7,14:3</vlarb-low>
>      <vl-high-limit>10</vl-high-limit>
>    </vlarb-scope>
>   </vlarb-tables>
>  </qos-setup>
> 
> <qos-levels>
>   <!-- the first one is just setting SL -->
>   <qos-level> <sn>1</sn> <use>for the lowest priority comm</use>
>     <sl>16</sl>
>   </qos-level>
>   <!-- the second sets SL and TClass -->
>   <qos-level> <sn>2</sn> <use>low latency best bandwidth</use>
>     <sl>0</sl> <tclass>7</tclass>
>   </qos-level>
>   <!-- the whole set: SL, TClass, MTU-Limit, Rate-Limit, Path-Bits  -->
>   <qos-level> <sn>3</sn> <use>just an example</use>
>     <sl>0</sl> <tclass>32</tclass> <mtu_limit>1</mtl_limit> 
>     <rate_limit>1</rate_limit>
>   </qos-level>
>  </qos-levels>
> 
>  <qos_match_rules>
>   <!-- matching by single criteria: tclass (list of values and ranges) -->
>   <qos_match_rule> <sn>1</sn> <use>low latency by tclass 7-9 or 11></use>
>    <tclass>7-9,11</tclass> <match-level>1</match-level>
>   </qos_match_rule>
>   <!-- show matching by destination group AND service-ids -->
>   <qos_match_rule> <sn>2</sn> <use>Storage targets connection></use>
>    <destination>Storage</destination> <service>22,4719</service>
>    <match-level>3</match-level>
>   </qos_match_rule>
>  </qos_match_rules>
> 
> </qos-policy>
> 
> 
> 4. IPoIB
> ---------
> 
> IPoIB already query the SA for its broadcast group information. The additional 
> functionality required is for IPoIB to provide the broadcast group SL, MTU, RATE 
> and TClass in every following PathRecord query performed when a new UDAV is 
> needed by IPoIB. 
> We could assign a special Service-ID for IPoIB use but since all communication 
> on the same IPoIB interface shares the same QoS-Level without the ability to 
> differentiate it by target service we can ignore it for simplicity.
> 
> 5. CMA features
> ----------------
> 
> The CMA interface supports Service-ID through the notion of port space as a 
> prefixes to the port_num which is part of the sockaddr provided to 
> rdma_resolve_add(). What is missing is the explicit request for a TClass that 
> should allow the ULP (like SDP) to propagate a specific request for a class of 
> service. A mechanism for providing the TClass is available in the IPv6 address, 
> so we could use that address field. Another option is to implement a special 
> connection options API for CMA.
> 
> Missing functionality by CMA is the usage of the provided TClass and Service-ID 
> in the sent PathRecord. When a response is obtained it is an existing 
> requirement for the CMA to use the PathRecord from the response in setting up 
> the QP address vector.
> 
> 
> 6. SDP
> -------
> 
> SDP uses CMA for building its connections. 
> The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex digits 
> holding the remote TCP/IP Port Number to connect to.
> SDP might be provided with SO_PRIORITY socket option. In that case the value 
> provided should be sent to the CMA as the TClass option of that connection. 
> 
> 7. SRP
> -------
> 
> Current SRP implementation uses its own CM callbacks (not CMA). So SRP should 
> fill in the Service-ID in the PathRecord by itself and use that information in 
> setting up the QP. The T10 SRP standard defines the SRP Service-ID to be defined 
> by the SRP target I/O Controller (but they should also comply with IBTA Service-
> ID rules). Anyway, the Service-ID is reported by the I/O Controller in the 
> ServiceEntries DMA attribute and should be used in the PathRecord if the SA 
> reports its ability to handle QoS PathRecords.
> 
> 8. iSER
> --------
> iSER uses CMA and thus should be very close to SDP. The Service-ID for iSER 
> should be TBD.
> 
> 
> 9. OpenSM features
> -------------------
> The QoS related functionality to be provided by OpenSM can be split into two 
> main parts:
> 
> 3.1. Fabric Setup
> During fabric initialization the SM should parse the policy and apply its 
> settings to the discovered fabric elements. The following actions should be 
> performed:
> * Parsing of policy
> * Node Group identification. Warning should be provided for each node not 
>   specified but found.

What about the other way 'round too (nodes specified but not found) ?

> * SL2VL settings validation should be checked:
>   + A warning will be provided if there are no matching targets for the SL2VL 
>     setting statement. 
>   + An error message will be printed to the log file if an invalid setting is 
>     found. A setting is invalid if it refers to:
>     - Non existing port numbers of the target devices
>     - Unsupported VLs for the target device. In the later case the map to non
>       existing VLs should be replaced to VL15 i.e. packets will be dropped.
> * SL2VL setting is to be performed
> * VL Arbitration table settings should be validated according to the following 
>   rules:
>   + A warning will be provided if there are no matching targets for the setting 
>     statement
>   + An error will be provided if the port number exceeds the target ports
>   + An error will be generated if the table length exceeds device capabilities
>   + An warning will be generated if the table quote a VL that is not supported 
>     by the target device
> * VL Arbitration tables will be set on the appropriate targets

One needs to be careful about these rules as there are a number of
different "shapes" to these tables.

> 3.2. PathRecord query handling:
> OpenSM should be able to enforce the provided policy on client request.
> The overall flow for such requests is: first the request is matched against the 
> defined match rules such that the target QoS-Level definition is found. Given 
> the QoS-Level a path(s) search is performed with the given restrictions imposed 
> by that level. The following two sections describe these steps.
> 
> One issue not standardized by the IBTA is how Service-ID is carried in the 
> PathRecord and MultiPathRecord attributes. There are basically two options:
> a.	Replace the SM-Key field by the Service-ID. In that case no component mask
>    bit will be assigned to it. Such that if the field is zero we should treat it 
>    as if the component mask bit is clear. 
> b. Encode it into spare fields. For PathRecord the first two fields are reserved 
>    and are 64 bit when combined. The first component mask bit maps to the first 
>    reserved field and should be used for Service-ID masking. For MultiPathRecord 
>    attribute there are no adjacent reserve fields that makes a 64 bit field. So 
>    the reserve field following the packet-lifetime (8 bits) combined with the 
>    reserved field DGIDCount (56 bits) can make the Service-ID. In this case also  
>    the first reserve field component mask bit should be used as the Service-ID 
>    component mask bit.
> 
> 
> 
> 3.2.1. Matching rule search:
> A rule is "matching" a PathRecord request using the following criteria:
> * Matching rules provide values in a list of either single value, or range of    
>   values. A PathRecord field is "matching" the rule field if it is explicitly    
>   noted in the list of values or is one of the values covered by a range 
>   included in the field values list.
> * Only PathRecord fields that have their component mask bit set should be
>   compared. 
> * For a rule to be "matching" a PathRecord request all the rule fields should be 
>   "matching" their PathRecord fields. Such that a PathRecord request that does 
>   not have a component mask field set for one of the rule defined fields  can 
>   not match that rule.
> * A PathRecord request that have a component mask bit set for one of the fields 
>   that is not defined by the rule  can match the rule. 
> 
> The algorithm to be used for searching for a rule match might be as simple as a 
> sequential search through all rules or enhanced for better performance. The 
> semantics of every rule field and its matching PathRecord field are described 
> below:
> * Source: the SGID or SLID should be part of this group
> * Destination: the DGID or DLID should be part of this group
> * Service-ID: check if the requested Service-ID (available in the PathRecord old 
>   SM-Key field) is matching any of this rule Service-IDs
> * TClass: check if the PathRecord TClass field is matching
> 
> 3.2.2 PathRecord response generation:
> The QoS-Level pointed by the first rule that matches the PathRecord request 
> should be used for obtaining the response SL, MTU-Limit, RATE-Limit, Path-Bits 
> and TClass. A default QoS-Level should be used if no rule is matching the query.
> 
> The efficient algorithm for finding paths that meet the QoS-Level criteria is 
> beyond the scope of this RFC and left for the implementer to provide. However 
> the criteria by which the paths match the QoS-Level are described below:
> 
> * SL: The paths found should all use the given SL. For that sake PathRecord 
>   algorithm should traverse the path from source to destination only through 
>   ports that carry a valid VL (not VL15) by the SL2VL map (should consider input 
>   and output ports and SL). 
> * MTU-Limit: The resulting paths MTU should not exceed the given MTU-Limit
> * Rate-Limit: The resulting paths RATE should not exceed the given RATE-Limit 
>   (rate limit is given in units of link BW = Width*Speed according to IBTA  
>   Specification Vol-1 table-205 p-901 l-24).
> * Path-Bits: define the target LID lowest bits (number of bits defined by the 
>   target port PortInfo.LMC field). The path should traverse the LFT using the 
>   target port LID with the path-bits set.
> * TClass: should be returned in the result PathRecord. When routing is going to 
>   be supported by OpenSM we might use this field in selecting the target 
>   router too in a TBD way.
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list