[openib-general] IPoIB -- connected mode update
Vivek Kashyap
kashyapv at us.ibm.com
Wed Aug 3 23:50:59 PDT 2005
Attached is an udpated draft (will be posting to internet drafts after the
current ietf ends) for ipoib-connected mode based on the discussions on
ipoib wg, openib (IB on Linux), and other communications. Two threads
that saw good discussion are given below. I believe the attached updated
draft captures all the discussions. Please comment.
http://openib.org/pipermail/openib-general/2005-May/006751.html
http://www1.ietf.org/mail-archive/web/ipoverib/current/msg01212.html
thanks,
Vivek
--------------------------------
IP over InfiniBand: Connected Mode
Abstract
This document specifies a method for transmitting IPv4/IPv6
packets and address resolution over the connected modes of
InfiniBand.
Table of Contents
1.0 Introduction
2.0 IPoIB-connected mode
2.1 Multicasting
2.2 Outline of Address Resolution
2.3 Outline of Connection Setup
3.0 Address Resolution
3.1 Link-layer Address
3.2 IB Connection Setup
3.3 Service-ID
4.0 Frame Format
5.0 Maximum Transmission Unit
5.1 Per-Connection MTU
6.0 IPoIB-CM Considerations
6.1 A Cautionary Note on IPoIB-RC
7.0 Security Considerations
8.0 IANA Considerations
9.0 References
1.0 Introduction
The InfiniBand specification [IB_ARCH] can be found at
www.infinibandta.org. The document [IPoIB_ARCH] provides a
short overview of InfiniBand architecture along with
consideration for specifying IP over InfiniBand networks.
The InfiniBand architecture (IBA) defines multiple modes of
transports. Of these the unreliable datagram (UD) transport
method best matches the needs of IP. IP over InfiniBand (IPoIB)
over UD is described in [IPoIB_UD]. This document describes
IP transmission over the connected modes of IBA.
IBA defines two connected modes:
1. Reliable Connected (RC)
2. Unreliable Connected (UC)
As is evident from the nomenclature, the two modes differ mainly
in providing reliability of data delivery across the connection.
This document applies equally to both the connected modes.
IPoIB over these two modes is referred to as IPoIB-CM (connected
mode) in this document. For clarity IPoIB over the unreliable
datagram mode, as described in [IPoIB_UD] is referred to as
IPoIB-UD.
IBA requires that all Host Channel Adapters (HCAs) support the
reliable and unreliable connected modes [IB_ARCH]. It is
optional for Target Channel Adapters (TCAs) to support the
connected modes.
The connected modes offer link MTUs of up to 2^31 octets in
length. Thus the use of connected modes can offer significant
benefits by supporting reasonably large MTUs. The datagram modes
of InfiniBand Architecture (IBA) are limited to 4096 octets.
Reliability is also enhanced if the underlying feature of
"automatic path migration" supported by the connected modes is
utilized.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described
in RFC 2119.
2.0 IPoIB-connected mode
Every IPoIB implementation MUST support IPoIB-UD. The IPoIB-CM
support is OPTIONAL.
This document extensively refers to [IPoIB_UD] and extends IPoIB
description given in [IPoIB_UD] to IPoIB-CM. Therefore, only
additional requirements or enhancements needed to enable IPoIB-
CM are described.
The IP encapsulation, default MTU, link layer address format and
the IPv6 stateless autoconfiguration mechanism apply to IPoIB-CM
exactly as described in [IPoIB_UD].
2.1 Multicasting
The connected modes of IBA define a non-broadcast, multiple
access network. The connected modes of IBA do not support
multicasting though every node can communicate with every other
node if desired.
This requires that multicasting be emulated in some form by the
network. However, in the case of an InfiniBand network, instead
of an emulation, an unreliable datagram (UD) queue pair (QP)
can be used to support multicasting while the connected mode QP
is used for unicast traffic. Since every IPoIB implementation
is required to support the UD mode, every implementation
supporting IPoIB-CM will be able to utilize the coexisting
IPoIB-UD QP for all broadcast/multicast communications.
Multicast mapping, transmission and reception of multicast
packets and multicast routing MUST use the IPoIB-UD QP
associated with the IPoIB-CM interface.
2.2 Outline of Address Resolution
Every IPoIB-CM interface MUST have two QPs associated with it:
1) A connected mode QP
2) An unreliable datagram mode QP
[IPoIB_UD] proposes that the address resolution query is
multicast over an IB multicast address that is joined by every
member of the IPoIB subnet. This IB multicast address is
referred to as the "broadcast-GID" [IPoIB_UD]. The "broadcsat-
GID" is "FullMember" joined by every IPoIB-UD implementation on
the associated QP [IPoIB-UD].
A broadcast-GID is formed with the knowledge of the scope bits,
IP version, the partition key (P_Key) associated with the
subnet. Thus these three parameters must be known to the node
before an IPoIB interface can be brought up. The exact format
and rules to setup the broadcast-GID are defined in [IPoIB_UD].
In response to the query the response is received on the IPoIB-
UD QP [IPoIB_UD].
2.3 Outline of Connection setup
Once the link address of the remote node is known an IB
connection must be setup between the nodes before any IP
communication may occur.
To make a connection, the sender must know the service-ID to use
in the request to make a connection [IB_ARCH]. It must also
supply the "connection mode" queue pair to the remote node. The
peer replies with its queue pair. Each IB connection is peer to
peer and uses one connected mode QP at each end.
Though the address resolution occurs at an individual IP address
level the connection between the nodes is at the IB layer.
Therefore every individual address resolution does not imply a
new connection between the peers.
3.0 Address Resolution
Address resolution queries are sent out on the "broadcast-GID"
over the IPoIB-UD QP associated with the IPoIB-CM interface. A
unicast reply is received on the UD QP.
3.1 Link-layer Address
IPoIB encapsulation [IPoIB_UD] describes the link-layer address
as follows:
<1 octet reserved>:QP: GID
This document extends the link-layer address as follows:
<Flags>:QPN:GID
Flags:
This is a single octet field. The bits indicate the
connected modes supported by the interface.
Bit 0 specifies the support for the "reliable connected"
(RC) mode. Bit 1 indicates the support for the
"unreliable connected" (UC) mode. All other bits in the
octet are reserved and MUST be set to 0 on transmits and
ignored on receives. The format of the flags is:
+--+--+--+--+--+--+--+--+
|RC|UC| 0| 0| 0| 0| 0| 0|
+--+--+--+--+--+--+--+--+
Both the RC and UC MAY be set at the same time if the
interface supports both the modes. Since the IPoIB-UD
mode is always supported there are no flags to indicate
IPoIB-UD support.
If IPoIB-CM is not supported i.e. if the implementation
only supports IPoIB-UD, then the implementation MUST
ignore the <Flags> on reception. It MUST set the <Flags>
octet to all zeroes on transmission as specified in
[IPoIB_UD].
QPN:
The queue-pair number (QPN) on which the unicast address
resolution reply will be received. This allows the
IPoIB-UD address resolution code and method to be used
for IPoIB-CM address resolution.
The QPN also serves another purpose. It is used to form
the Service-ID that is used to setup the IB connection.
On receiving the multicast/broadcast address resolution request
the receiver replies with its own link-address, including the
associated UD QPN and the appropriate flags.
The receiver's reply is unicast back to the sender after the
receiver has, as in the case of IPoIB-UD, resolved the GID to
the LID and determined other required parameters [IPoIB_UD].
Once the address resolution is completed the underlying IB
connection on the supported connection modes can be set up. An
implementation is NOT REQUIRED to setup a connection merely
because the peer indicates the capability. The decision to make
such a connection is left to the implementation.
3.2 IB Connection Setup
The IB reliable/unreliable mode connection may be setup by any
of the peers though it is more likely that the one that
initiated the address resolution phase, probably as a result of
the need to send IP data, will initiate the connection setup.
IBA allows passive-active and active-active connection setup
[IB_ARCH].
To setup a connection IB Management Datagrams (MADs) are
directed to the peer's communication manager (CM). The
connection request always contains a Service-ID for the peer to
associate the request with the appropriate entity. If the
request is accepted the peer returns the relevant connected mode
QPN in the response MAD. The format of the CM connection
messages and the IB connection setup process is described in
[IB_ARCH].
The CM messages include, among other parameters, the Service-ID,
Local QPN, and the payload size to use over the connection.
Note:
The IB connection is setup using the Service-ID as defined
above. The node MUST keep a record of IB connections it is
participating in. The node MAY attempt another connection
to the remote peer using the same Service-ID as used for
an existing IB connection. Similarly, the receiver of such
a connection MAY drop the request with a suitable error
indication in the CM response. The decision to accept or
initiate multiple connections from or to an IPoIB
interface is left to the implementation.
3.3 Service-ID
The InfiniBand specification defines a block of service IDs for
IETF use. The InfiniBand specification has left the definition
and management of this block to the IETF [IB_ARCH]. The 64-bit
block is:
+--------+--------+--------+--------+-------+--------+--------+------+
|00000001|<-------------------IETF use------------------------------>|
+--------+--------+--------+--------+-------+--------+--------+------+
The Service-IDs used by IPoIB will be in the format:
+--------+--------+--------+--------+-------+-------+--------+-------+
|00000001| Type | Reserved | QPN |
+--------+--------+--------+--------+-------+-------+--------+-------+
The Reserved fields MUST be transmitted as zeroes. They are
ignored on reception.
The QPN MUST be the UD QP exchanged during address resolution.
The Type MUST be set to 0.
4.0 Frame Format
All IP and ARP datagrams transported over InfiniBand are
prefixed by a 4-octet encapsulation header as described in
[IPoIB_UD].
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | |
| Type | Reserved |
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The type field SHALL indicate the encapsulated protocol as per
the following table.
+----------+-------------+
| Type | Protocol |
|------------------------|
| 0x800 | IPv4 |
|------------------------|
| 0x806 | ARP |
|------------------------|
| 0x8035 | RARP |
|------------------------|
| 0x86DD | IPv6 |
+------------------------+
These values are taken from the "ETHER TYPE" numbers assigned by
Internet Assigned Numbers Authority (IANA). Other network
protocols, identified by different values of "ETHER TYPE", may
use the encapsulation format defined herein but such use is
outside of the scope of this document.
5.0 Maximum Transmission Unit
The IB connection setup might be used for both IPv4 and IPv6 or
it could be used for only one of them while a different
connection is used for the other. The link MTU MUST be able to
support the minimum MTU required by the protocols.
The default MTU of the IPoIB-CM interface is 2044 octets i.e.
2048 octet IPoIB-link MTU minus the 4 octet encapsulation
header.
However, connected modes of InfiniBand allow message sizes up to
2^31 octets. Therefore, IPoIB-CM can use a much larger MTU for
unicast communication between any two endpoints. The maximum
and/or optimal payload that can be received or sent over an
InfiniBand connection is dependent on the implementation, HCA
and the resources configured.
An implementation MAY utilise the following mechanism to
exchange the optimal message size across the IB connection.
5.1 Per-Connection MTU
Every IB connection setup message includes a "private data"
field [IB_ARCH]. The private data field MUST carry the following
information:
0 15
+----------------+
| Receive MTU |
+----------------+
The connection setup message (CM REQ) MUST insert the requested
MTU in the "Receive MTU" field. This indicates the maximum
packet size the requester can accept. The requester MUST be able
to accept smaller MTU sizes as well.
It is up to the implementation to utilize this mechanism for
setting the per IB connection MTU. The IPoIB interface must
account for the 4-octet encapsulation header and so the IPoIB
MTU over the connection will be smaller by that amount.
This mechanism allows for different MTU values per peer, however
to enable implmentations to work with a single "connection" MTU,
a configuration parameter called "IPoIB-CM MTU multiplier" is
introduced. The default value of "IPoIB-CM MTU multiplier" is
1. The "Receive MTU" MUST NOT be set less than "IPoIB-CM MTU
multiplier" times 2048.
6.0 IPoIB-CM Considerations
Every IPoIB interface supports IPoIB-UD. It may additionally
support one or both of IPoIB-CM modes. Therefore, there can be
multiple methods of communicating between any two peers. This
implies that an interface MAY transmit/receive a packet over any
of the RC, UC or UD modes depending on the modes supported
between it and the peer. It further follows that every IPoIB
implementation compliant with this document MUST accept all
unicast transmissions over any fo the IPoIB modes it supports.
Multicast and broadcast packets by their nature will always be
transmitted and received over the IPoIB-UD QP.
6.1 A Cautionary Note on IPoIB-RC
The RC mode of InfiniBand guarantees in-order delivery of
packets. Every message transmitted over the RC connection is
broken into physical MTU sized packets by the RC connection. If
any packet is lost, it is retransmitted until the complete
message is exchanged. Therefore there is a possibility of a
reliable transport layer, such as TCP, retransmitting due to a
shorter timeout while the RC layer is still in the process of
transferring the complete message. A retransmission at the upper
layer will add to the already existing congestion.
Therefore, the RC timers as well as the maximum message size
supported at the IPoIB-RC connection must be set judiciously.
7.0 Security Considerations
A node may be returned a false set of flags by an impostor. This
may cause unnecessary attempts and some delay/disruption in
IPoIB communication. The same is the case if wrong/spurious QPN
values are provided during address resolution
broadcast/multicast.
8.0 IANA Considerations
This document requires that the reserved bits and octets be set
to zero on sends and ignored on receives. Proposed uses of the
reserved bits MUST be published as RFCs.
9.0 References
Normative
[IB_ARCH] InfiniBand Architecture Specification, version 1.1
www.infinibandta.org
[IPoIB_ARCH] draft-ietf-ipoib-architecture-04.txt, V. Kashyap
[IPoIB_UD] draft-ietf-ipoib-ip-over-infiniband-9.txt,
H.K. Jerry Chu, V. Kashyap
Author's Address
Vivek Kashyap
15350, SW Koll Parkway Beaverton, OR 97006
Phone: +1 503 578 3422 Email: vivk at us.ibm.com
More information about the general
mailing list