[openib-general] QoS RFC - Resend using a friendly mailer

Eitan Zahavi eitan at mellanox.co.il
Tue May 30 12:34:55 PDT 2006


Hi Hal,

Please see my responses inside

Eitan
> >
> >               RFC: OpenFabrics Enhancements for QoS Support
> >              ===============================================
> >
> > Authors: . Eitan Zahavi <eitan at mellanox.co.il>
> > Date: .... May 2006.
> > Revision:  0.1
> >
> > Table of contents:
> > 1. Overview
> > 2. Architecture
> > 3. Supported Policy
> > 4. CMA functionality
> > 5. IPoIB functionality
> > 6. SDP functionality
> > 7. SRP functionality
> > 8. iSER functionality
> > 9. OpenSM functionality
> >
> > 1. Overview
> > ------------
> > Quality of Service requirements stem from the realization of I/O
consolidation
> > over IB network: As multiple applications and ULPs share the same
fabric, means
> > to control their use of the network resources are becoming a must.
The basic
> > need is to differentiate the service levels provided to different
traffic flows.
> > Such that a policy could be enforced and control each flow
utilization of the
> > fabric resources.
> >
> > IBTA specification defined several hardware features and management
interfaces
> > to support QoS:
> > * Up to 15 Virtual Lanes (VL) could carry traffic in a non-blocking
manner
> > * Arbitration between traffic of different VL is performed by a 2
priority
> >   levels weighted round robin arbiter. The arbiter is programmable
with
> >   a sequence of (VL, weight) pairs and maximal number of high
priority credits
> >   to be processed before low priority is served
> > * Packets carry class of service marking in the range 0 to 15 in
their
> >   header SL field
> > * Each switch can map the incoming packet by its SL to a particular
output
> >   VL based on programmable table VL=SL-to-VL-MAP(in-port, out-port,
SL)
> > * The Subnet Administrator controls each communication flow
parameters
> >   by providing them as a response to Path Record query
> >
> > The IB QoS features provide the means to implement a DiffServ like
architecture.
> > DiffServ architecture (IETF RFC2474 2475) is widely used today in
highly dynamic
> > fabrics.
> 
> Only certain DSCP code point equivalents are provided by IBA.
[EZ] True.
> 
> > This proposal provides the detailed functional definition for the
various
> > software elements that are required to enable a DiffServ like
architecture over
> > the OpenFabrics software stack.
> >
> >
> >
> >
> >
> > 2. Architecture
> > ----------------
> > This proposal split the QoS functionality between the SM/SA, CMA and
the various
> > ULPS. We take the "chronology approach" to describe how the overall
system
> > works:
> >
> > 2.1. The network manager (human) provides a set of rules (policy)
that defines
> > how the network is being configured and how its resources are split
to different
> > QoS-Levels. The policy also define how to decide which QoS-Level
each
> > application or ULP or service use.
> 
> > 2.2. The SM analyzes the provided policy to see if it is realizable
and performs
> > the necessary fabric setup. The SM may continuously monitor the
policy and adapt
> > to changes in it.
> 
> Do you mean monitor the policy or the fabric here ?
[EZ] I mean monitor the policy such that changes in it are enforced.
> 
> >  Part of this policy defines the default QoS-Level of each
> > partition. The SA is being enhanced to match the requested Source,
Destination,
> > TClass, Service-ID
> 
> Service ID does not apply to many ULPs. Also, how is it known what
> ULP/application a particular service ID refers to (other than perhaps
> some well known ones) ?
[EZ] True - only well known Service-IDs can have a predefined policy
attached to. 
But I disagree on the fact services are unknown - if they are unknown
how are they being found by the clients?
> 
> >  (and optionally SL and priority) against the policy. So
> > clients (ULPs, programs) can obtain a policy enforced QoS. The SM is
also
> > enhanced to support setting up partitions with appropriate IPoIB
broadcast
> > group. This broadcast group carries its QoS attributes: TClass, SL,
MTU and
> > RATE.
> >
> > 2.3. IPoIB is being setup. IPoIB uses the SL, MTU and RATE available
on the
> > multicast group which forms the broadcast group of this partition.
> >
> > 2.4. MPI which provides non IB based connection management should be
> configured
> > to run using hard coded SLs. It uses these SLs in every QP being
opened.
> >
> > 2.5. ULPs that use CM interface (like SRP) should have their own
pre-assigned
> > Service-ID and use it while obtaining PathRecord for establishing
their
> > connections. The SA receiving the PathRecord should match it against
the policy
> > and return the appropriate PathRecord including SL, MTU, RATE and
TClass.
> >
> > 2.6. ULPs and programs using CMA to establish RC connection should
provide the
> > CMA the target IP and Service-ID. Some of the ULPs might also
provide TClass
> > (E.g. for SDP sockets that are provided the TOS socket option). The
CMA should
> > then use the provided Service-ID and optional TClass and pass them
in the
> > PathRecord request. The resulting PathRecord should be used for
configuring the
> > connection QP.
> >
> > PathRecord and MultiPathRecord enhancement for QoS:
> > As mentioned above the PathRecord and MultiPathRecord attributes
should be
> > enhanced to carry the Service-ID which is a 64bit value. Given the
existing
> > definition for these attributes we propose to use the following
fields for
> > Service-ID:
> > * For PathRecord: use the first 2 reserved fields whicg are 32bits
each
> >   (component masks 0x1 and 0x2). Component mask 1 should be used to
refer to the
> >   merged Service-ID field
> > * For MultiPathRecord: use 2 reserved fields:
> >   1. after the packet life (8 bits) which is component mask bit
0x10000 (17)
> >   2. the field before SDGID1 (56 bits) which is component mask bit
0x200000 (22)
> 
> This is not possible with the existing approved 1.2 erratum changes.
[EZ] Ooops I was using 1.2 spec. Can you elaborate on the field I
missed? Can we find a replacement?
> 
> >   Once merged they should be selected using component mask bit
0x10000 (17)
> > A new capability bit should describe the SM QoS support in the SA
class port
> > info. This approach provides an easy migration path for existing
access layer
> > and ULPs by not introducing a new attribute.
> >
> >
> > 3. Supported Policy
> > --------------------
> >
> > The QoS policy supported by this proposal is divided into 4 sub
sections:
> >
> > * Node Group: a set of HCAs, Routers or Switches that share the same
settings.
> > A node groups might be a partition defined by the partition manager
policy in
> > terms of GUIDs. Future implementations might provide support for
> NodeDescription
> > based definition of node groups.
> >
> > * Fabric Setup:
> > Defines how the SL2VL and VLArb tables should be setup. This policy
definition
> > assumes the computation of target behavior should be performed
outside of
> > OpenSM.
> >
> > * QoS-Levels Definition:
> > This section defines the possible sets of parameters for QoS that a
client might
> > be mapped to. Each set holds: SL and optionally: Max MTU, Max Rate,
Path Bits
> > (in case LMC > 0 is used for QoS) and TClass.
> >
> > * Matching Rules:
> > A list of rules that match an incoming PathRecord request to a
QoS-Level. The
> > rules are processed in order such as the first match is applied.
Each rule is
> > built out of set of match expressions which should all match for the
rule to
> > apply. The matching expressions are defined for the following fields
> > ** SRC and DST to lists of node groups
> > ** Service-ID to a list of Service-ID or Service-ID ranges
> > ** TClass to a list of TClass values or ranges
> >
> > XML style syntax is provided for the policy file. However, a strict
BNF format
> > (provided in section 8)
> 
> What section ?
[EZ] Sorry I planned to add it and did not make it for this mail. Please
ignore this. 
I will provide the BNF once we make some progress.
> 
> >  should be used for parsing it.
> >
> > <?xml version="1.0" encoding="ISO-8859-1"?>
> > <qos-policy>
> >  <!-- Port Groups define sets of ports to be used later in the
settings -->
> >  <port-groups>
> >   <!-- using port GUIDs -->
> >   <port-group> <name>Storage</name> <use>our SRP storage
targets</use>
> >    <port-guid>0x1000000000000001</port-guid>
> >    <port-guid>0x1000000000000002</port-guid>
> >   </port-group>
> >   <!-- using names obtained by concatenation of first 2 words of
NodeDescription
> >     and port number -->
> >   <port-group> <name>Virtual Servers</name> <use>node desc and IB
port #</use>
> >    <port-name>vs1/HCA-1/P1</port-name>
> >    <port-name>vs3/HCA-1/P1</port-name>
> >    <port-name>vs3/HCA-2/P1</port-name>
> >   </port-group>
> >   <!-- using partitions defined in the partition policy -->
> >   <port-group> <name>Partition 1</name> <use>default settings</use>
> >    <partition>Part1</partition>
> >   </port-group>
> >   <!-- using node types HCA|ROUTER|SWITCH -->
> >   <port-group> <name>Routers</name> <use>all routers</use>
> >    <node-type>ROUTER</node-type>
> >   </port-group>
> >  </port-groups>
> >  <qos-setup>
> >   <!-- define all types of SL2VL tables always have 16 VL entries
-->
> >   <sl2vl-tables>
> >    <!-- scope defines the exact devices and in/out ports the tables
apply to
> >     if the same port is matching several rules the last one applies
-->
> >    <sl2vl-scope> <group>Part1</group> <from>*</from> <to>*</to>
> >      <sl2vl-table>0,1,2,3,4,5,6,7,8,9,10,11,12,13,14</sl2vl-table>
> >    </sl2vl-scope>
> >    <!-- "across" means the port just connected to the given group,
> >      also the link across port 1 is probably supporting only 2 VLs
-->
> >    <sl2vl-scope> <across>Storage</across> <from>*</from> <to>1</to>
> >      <sl2vl-table>0,1,1,1,1,1,1,1,1,1,1,1,1,1,1</sl2vl-table>
> >    </sl2vl-scope>
> >   <sl2vl-tables>
> >
> >   <!-- define all types of VLArb tables. The length of the tables
should
> >    match the physically supported tables by their target ports -->
> >   <vlarb-tables>
> >    <!-- scope defines the exact ports the VLArb tables apply to -->
> >    <vlarb-scope> <group>Storage</group> <to>*</to>
> >      <!-- VLArb table holds VL and weight pairs -->
> >      <vlarb-high>0:255,1:127,2:63,3:31,4:15,5:7,6:3,7:1</vlarb-high>
> >      <vlarb-low>8:255,9:127,10:63,11:31,12:15,13:7,14:3</vlarb-low>
> >      <vl-high-limit>10</vl-high-limit>
> >    </vlarb-scope>
> >   </vlarb-tables>
> >  </qos-setup>
> >
> > <qos-levels>
> >   <!-- the first one is just setting SL -->
> >   <qos-level> <sn>1</sn> <use>for the lowest priority comm</use>
> >     <sl>16</sl>
> >   </qos-level>
> >   <!-- the second sets SL and TClass -->
> >   <qos-level> <sn>2</sn> <use>low latency best bandwidth</use>
> >     <sl>0</sl> <tclass>7</tclass>
> >   </qos-level>
> >   <!-- the whole set: SL, TClass, MTU-Limit, Rate-Limit, Path-Bits
-->
> >   <qos-level> <sn>3</sn> <use>just an example</use>
> >     <sl>0</sl> <tclass>32</tclass> <mtu_limit>1</mtl_limit>
> >     <rate_limit>1</rate_limit>
> >   </qos-level>
> >  </qos-levels>
> >
> >  <qos_match_rules>
> >   <!-- matching by single criteria: tclass (list of values and
ranges) -->
> >   <qos_match_rule> <sn>1</sn> <use>low latency by tclass 7-9 or
11></use>
> >    <tclass>7-9,11</tclass> <match-level>1</match-level>
> >   </qos_match_rule>
> >   <!-- show matching by destination group AND service-ids -->
> >   <qos_match_rule> <sn>2</sn> <use>Storage targets connection></use>
> >    <destination>Storage</destination> <service>22,4719</service>
> >    <match-level>3</match-level>
> >   </qos_match_rule>
> >  </qos_match_rules>
> >
> > </qos-policy>
> >
> >
> > 4. IPoIB
> > ---------
> >
> > IPoIB already query the SA for its broadcast group information. The
additional
> > functionality required is for IPoIB to provide the broadcast group
SL, MTU, RATE
> > and TClass in every following PathRecord query performed when a new
UDAV is
> > needed by IPoIB.
> > We could assign a special Service-ID for IPoIB use but since all
communication
> > on the same IPoIB interface shares the same QoS-Level without the
ability to
> > differentiate it by target service we can ignore it for simplicity.
> >
> > 5. CMA features
> > ----------------
> >
> > The CMA interface supports Service-ID through the notion of port
space as a
> > prefixes to the port_num which is part of the sockaddr provided to
> > rdma_resolve_add(). What is missing is the explicit request for a
TClass that
> > should allow the ULP (like SDP) to propagate a specific request for
a class of
> > service. A mechanism for providing the TClass is available in the
IPv6 address,
> > so we could use that address field. Another option is to implement a
special
> > connection options API for CMA.
> >
> > Missing functionality by CMA is the usage of the provided TClass and
Service-ID
> > in the sent PathRecord. When a response is obtained it is an
existing
> > requirement for the CMA to use the PathRecord from the response in
setting up
> > the QP address vector.
> >
> >
> > 6. SDP
> > -------
> >
> > SDP uses CMA for building its connections.
> > The Service-ID for SDP is 0x000000000001PPPP, where PPPP are 4 hex
digits
> > holding the remote TCP/IP Port Number to connect to.
> > SDP might be provided with SO_PRIORITY socket option. In that case
the value
> > provided should be sent to the CMA as the TClass option of that
connection.
> >
> > 7. SRP
> > -------
> >
> > Current SRP implementation uses its own CM callbacks (not CMA). So
SRP should
> > fill in the Service-ID in the PathRecord by itself and use that
information in
> > setting up the QP. The T10 SRP standard defines the SRP Service-ID
to be defined
> > by the SRP target I/O Controller (but they should also comply with
IBTA Service-
> > ID rules). Anyway, the Service-ID is reported by the I/O Controller
in the
> > ServiceEntries DMA attribute and should be used in the PathRecord if
the SA
> > reports its ability to handle QoS PathRecords.
> >
> > 8. iSER
> > --------
> > iSER uses CMA and thus should be very close to SDP. The Service-ID
for iSER
> > should be TBD.
> >
> >
> > 9. OpenSM features
> > -------------------
> > The QoS related functionality to be provided by OpenSM can be split
into two
> > main parts:
> >
> > 3.1. Fabric Setup
> > During fabric initialization the SM should parse the policy and
apply its
> > settings to the discovered fabric elements. The following actions
should be
> > performed:
> > * Parsing of policy
> > * Node Group identification. Warning should be provided for each
node not
> >   specified but found.
> 
> What about the other way 'round too (nodes specified but not found) ?
[EZ] Yep. Will require some warning too.
> 
> > * SL2VL settings validation should be checked:
> >   + A warning will be provided if there are no matching targets for
the SL2VL
> >     setting statement.
> >   + An error message will be printed to the log file if an invalid
setting is
> >     found. A setting is invalid if it refers to:
> >     - Non existing port numbers of the target devices
> >     - Unsupported VLs for the target device. In the later case the
map to non
> >       existing VLs should be replaced to VL15 i.e. packets will be
dropped.
> > * SL2VL setting is to be performed
> > * VL Arbitration table settings should be validated according to the
following
> >   rules:
> >   + A warning will be provided if there are no matching targets for
the setting
> >     statement
> >   + An error will be provided if the port number exceeds the target
ports
> >   + An error will be generated if the table length exceeds device
capabilities
> >   + An warning will be generated if the table quote a VL that is not
supported
> >     by the target device
> > * VL Arbitration tables will be set on the appropriate targets
> 
> One needs to be careful about these rules as there are a number of
> different "shapes" to these tables.
[EZ] Not sure what you mean by shape. IBTA defined all VLArb with same
format?
> 
> > 3.2. PathRecord query handling:
> > OpenSM should be able to enforce the provided policy on client
request.
> > The overall flow for such requests is: first the request is matched
against the
> > defined match rules such that the target QoS-Level definition is
found. Given
> > the QoS-Level a path(s) search is performed with the given
restrictions imposed
> > by that level. The following two sections describe these steps.
> >
> > One issue not standardized by the IBTA is how Service-ID is carried
in the
> > PathRecord and MultiPathRecord attributes. There are basically two
options:
> > a.	Replace the SM-Key field by the Service-ID. In that case no
component mask
> >    bit will be assigned to it. Such that if the field is zero we
should treat it
> >    as if the component mask bit is clear.
> > b. Encode it into spare fields. For PathRecord the first two fields
are reserved
> >    and are 64 bit when combined. The first component mask bit maps
to the first
> >    reserved field and should be used for Service-ID masking. For
MultiPathRecord
> >    attribute there are no adjacent reserve fields that makes a 64
bit field. So
> >    the reserve field following the packet-lifetime (8 bits) combined
with the
> >    reserved field DGIDCount (56 bits) can make the Service-ID. In
this case also
> >    the first reserve field component mask bit should be used as the
Service-ID
> >    component mask bit.
> >
> >
> >
> > 3.2.1. Matching rule search:
> > A rule is "matching" a PathRecord request using the following
criteria:
> > * Matching rules provide values in a list of either single value, or
range of
> >   values. A PathRecord field is "matching" the rule field if it is
explicitly
> >   noted in the list of values or is one of the values covered by a
range
> >   included in the field values list.
> > * Only PathRecord fields that have their component mask bit set
should be
> >   compared.
> > * For a rule to be "matching" a PathRecord request all the rule
fields should be
> >   "matching" their PathRecord fields. Such that a PathRecord request
that does
> >   not have a component mask field set for one of the rule defined
fields  can
> >   not match that rule.
> > * A PathRecord request that have a component mask bit set for one of
the fields
> >   that is not defined by the rule  can match the rule.
> >
> > The algorithm to be used for searching for a rule match might be as
simple as a
> > sequential search through all rules or enhanced for better
performance. The
> > semantics of every rule field and its matching PathRecord field are
described
> > below:
> > * Source: the SGID or SLID should be part of this group
> > * Destination: the DGID or DLID should be part of this group
> > * Service-ID: check if the requested Service-ID (available in the
PathRecord old
> >   SM-Key field) is matching any of this rule Service-IDs
> > * TClass: check if the PathRecord TClass field is matching
> >
> > 3.2.2 PathRecord response generation:
> > The QoS-Level pointed by the first rule that matches the PathRecord
request
> > should be used for obtaining the response SL, MTU-Limit, RATE-Limit,
Path-Bits
> > and TClass. A default QoS-Level should be used if no rule is
matching the query.
> >
> > The efficient algorithm for finding paths that meet the QoS-Level
criteria is
> > beyond the scope of this RFC and left for the implementer to
provide. However
> > the criteria by which the paths match the QoS-Level are described
below:
> >
> > * SL: The paths found should all use the given SL. For that sake
PathRecord
> >   algorithm should traverse the path from source to destination only
through
> >   ports that carry a valid VL (not VL15) by the SL2VL map (should
consider input
> >   and output ports and SL).
> > * MTU-Limit: The resulting paths MTU should not exceed the given
MTU-Limit
> > * Rate-Limit: The resulting paths RATE should not exceed the given
RATE-Limit
> >   (rate limit is given in units of link BW = Width*Speed according
to IBTA
> >   Specification Vol-1 table-205 p-901 l-24).
> > * Path-Bits: define the target LID lowest bits (number of bits
defined by the
> >   target port PortInfo.LMC field). The path should traverse the LFT
using the
> >   target port LID with the path-bits set.
> > * TClass: should be returned in the result PathRecord. When routing
is going to
> >   be supported by OpenSM we might use this field in selecting the
target
> >   router too in a TBD way.
> >



More information about the general mailing list