[ofa-general] iWARP peer-to-peer CM proposal

Wed Nov 21 08:02:25 PST 2007

Group,

below is proposal on how to resolve peer-to-peer iWARP CM issue
discovered at interop event.

The main issue is that MPA spec (relevant portion of IETF RFC 5044 is
below) require that

connection initiator send first message over the established connection.

Multiple MPI implementations and several other apps use peer-to-peer
model.

So rather then forcing all of them to do it on their own, which will not
help with

interop between different implementations, the goal is to extend lower
layers to provide it.

Our first idea was to leave MPA protocol untouched and try to solve this
problem

in iw_cm. But there are too many complications to it. First, in order to
adhere to RFC5044

initiator must send first FPDU and responder process it. But since the
connection is already

established processing FPDU involves ULP on whose behalf the connection
is created.

So either initiator sends a message which generates completion on
responder CQ, thus visible

to ULP, or not. In the later case, the only op which can do it is RDMA
one, which means

that responder somehow provided initiator S-tag which it can use. So,
this is an extension

to MPA, probably using private data. And that responder upon receiving
it destroy this S-tag.

In any case this is an extension of MPA. 

In the former, Send is used but this requires a buffer to be posted to
CQ. But since

the same CQ (or SharedCQ) can be used by other connections at the same
time it can cause

the responder CM posted buffer to be consumed by other connection. This
is not acceptable.

So new we consider extension to MPA protocol.

The goal is to be completely backwards compatible to existing version 1.

In a nutshell, use a "flag" in the MPA request message which indicates
that

"ready to receive" message will be send by requestor upon receiving

MPA response message with connection acceptance.

here are the changes to IETF RFC5044

1.      0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   0  |                                                               |
      +         Key (16 bytes containing "MPA ID Req Frame")          +
   4  |      (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65)        |
      +         Or  (16 bytes containing "MPA ID Rep Frame")          +
   8  |      (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65)        |
      +         Or  (16 bytes containing "MPA ID Rtr Frame")          +
   12 |      (4D 50 41 20 49 44 20 52 74 52 20 46 72 61 6D 65)        |
+
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   16 |M|C|R|S| Res   |     Rev       |          PD_Length            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      ~                                                               ~
      ~                   Private Data                                ~
      |                                                               |
      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

2. S: indicator in the Req frame whether or not Requestor will send Rtr
frame.

      In Req frame, if set to 1 then Rtr frame will be sent if responder

    sends Rep frame with accept bit set. 0 indicate that Rtr frame

    will not be sent.

    In Rep frame, 0 means that Responder cannot support Rtr frame,

    while 1 that it is and is waiting for it.

     (While my preference is to handle this as MPA protocol version
matching rules,

    proposed method will provide complete backwards compatibility) 

    Unused by Rtr frame. That is set to 0 in Rtr frame and ignored

    by responder.

    All other bits M,C,R and remainder of Res treated as in MPA ver 1.

    Rtr frame adhere to C bit as specified in Rep frame

3.  No private data format is defined for Rtr in this version.

4. Example will be added to present Rtr model.

 That is if S bit is not set the current MPA ver 1 model is followed.

And if S bit is set then "proposed" model with Rtr message is followed.

5. Requestor use of Rtr frame must adhere to S bit setting of Rep frame.

**************************************************

While the process of driving this proposal thru IETF is very very
length,

in order to solve this problem now, we can still use this proposal with
the

current version 1 of MPA. All existing implementation will still work.

And if both sides support this change than peer-to-peer model is also
provided.

Comments, suggestion, critics requested.

I am especially want to know if we are missing some gotch you

which was discussed by RDDP WG when they rejected peer-to-peer model for
MPA.

iWARP vendors, please comment on the feasibility of implementing this
MPA

extension.

*********************************************************************

7.  Connection Semantics

7.1.  Connection Setup

   MPA requires that the Consumer MUST activate MPA, and any TCP
   enhancements for MPA, on a TCP half connection at the same location
   in the octet stream at both the sender and the receiver.  This is
   required in order for the Marker scheme to correctly locate the
   Markers (if enabled) and to correctly locate the first FPDU.

   MPA, and any TCP enhancements for MPA are enabled by the ULP in both
   directions at once at an endpoint.

Culley, et al.              Standards Track                    [Page 24]
  <http://tools.ietf.org/html/rfc5044#page-25> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   This can be accomplished several ways, and is left up to DDP's ULP:

   *   DDP's ULP MAY require DDP on MPA startup immediately after TCP
       connection setup.  This has the advantage that no streaming mode
       negotiation is needed.  An example of such a protocol is shown in
       Figure 10: Example Immediate Startup negotiation.

       This may be accomplished by using a well-known port, or a service
       locator protocol to locate an appropriate port on which DDP on
       MPA is expected to operate.

   *   DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
       normal TCP startup, using TCP streaming data exchanges on the
       same connection.  The exchange establishes that DDP on MPA (as
       well as other ULPs) will be used, and exactly locates the point
       in the octet stream where MPA is to begin operation.  Note that
       such a negotiation protocol is outside the scope of this
       specification.  A simplified example of such a protocol is shown
       in Figure 9: Example Delayed Startup negotiation on page 33.

   An MPA endpoint operates in two distinct phases.

   The Startup Phase is used to verify correct MPA setup, exchange CRC
   and Marker configuration, and optionally pass Private Data between
   endpoints prior to completing a DDP connection.  During this phase,
   specifically formatted frames are exchanged as TCP byte streams
   without using CRCs or Markers.  During this phase a DDP endpoint need
   not be "bound" to the MPA connection.  In fact, the choice of DDP
   endpoint and its operating parameters may not be known until the
   Consumer supplied Private Data (if any) has been examined by the
   Consumer.

   The second distinct phase is Full Operation during which FPDUs are
   sent using all the rules that pertain (CRCs, Markers, MULPDU
   restrictions, etc.).  A DDP endpoint MUST be "bound" to the MPA
   connection at entry to this phase.

   When Private Data is passed between ULPs in the Startup Phase, the
   ULP is responsible for interpreting that data, and then placing MPA
   into Full Operation.

   Note: The following text differentiates the two endpoints by calling
       them Initiator and Responder.  This is quite arbitrary and is NOT
       related to the TCP startup (SYN, SYN/ACK sequence).  The
       Initiator is the side that sends first in the MPA startup
       sequence (the MPA Request Frame).

Culley, et al.              Standards Track                    [Page 25]
  <http://tools.ietf.org/html/rfc5044#page-26> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   Note: The possibility that both endpoints would be allowed to make a
       connection at the same time, sometimes called an active/active
       connection, was considered by the work group and rejected.  There
       were several motivations for this decision.  One was that
       applications needing this facility were few (none other than
       theoretical at the time of this document).  Another was that the
       facility created some implementation difficulties, particularly
       with the "dual stack" designs described later on.  A last issue
       was that dealing with rejected connections at startup would have
       required at least an additional frame type, and more recovery
       actions, complicating the protocol.  While none of these issues
       was overwhelming, the group and implementers were not motivated
       to do the work to resolve these issues.  The protocol includes a
       method of detecting these active/active startup attempts so that
       they can be rejected and an error reported.

   The ULP is responsible for determining which side is Initiator or
   Responder.  For client/server type ULPs, this is easy.  For peer-peer
   ULPs (which might utilize a TCP style active/active startup), some
   mechanism (not defined by this specification) must be established, or
   some streaming mode data exchanged prior to MPA startup to determine
   which side starts in Initiator and which starts in Responder MPA
   mode.

7.1.1  MPA Request and Reply Frame Format

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   0  |                                                               |
      +         Key (16 bytes containing "MPA ID Req Frame")          +
   4  |      (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65)        |
      +         Or  (16 bytes containing "MPA ID Rep Frame")          +
   8  |      (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65)        |
      +                                                               +
   12 |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   16 |M|C|R| Res     |     Rev       |          PD_Length            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      ~                                                               ~
      ~                   Private Data                                ~
      |                                                               |
      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 8: MPA Request/Reply Frame

Culley, et al.              Standards Track                    [Page 26]
  <http://tools.ietf.org/html/rfc5044#page-27> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   Key: This field contains the "key" used to validate that the sender
       is an MPA sender.  Initiator mode senders MUST set this field to
       the fixed value "MPA ID Req Frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal).  Responder
       mode receivers MUST check this field for the same value, and
       close the connection and report an error locally if any other
       value is detected.  Responder mode senders MUST set this field to
       the fixed value "MPA ID Rep Frame" or (in byte order) 4D 50 41 20
       49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).  Initiator
       mode receivers MUST check this field for the same value, and
       close the connection and report an error locally if any other
       value is detected.

   M: This bit declares an endpoint's REQUIRED Marker usage.  When this
       bit is '1' in an MPA Request Frame, the Initiator declares that
       Markers are REQUIRED in FPDUs sent from the Responder.  When set
       to '1' in an MPA Reply Frame, this bit declares that Markers are
       REQUIRED in FPDUs sent from the Initiator.  When in a received
       MPA Request Frame or MPA Reply Frame and the value is '0',
       Markers MUST NOT be added to the data stream by that endpoint.
       When '1' Markers MUST be added as described in Section 4.3
<http://tools.ietf.org/html/rfc5044#section-4.3> , MPA
       Markers.

   C: This bit declares an endpoint's preferred CRC usage.  When this
       field is '0' in the MPA Request Frame and the MPA Reply Frame,
       CRCs MUST not be checked and need not be generated by either
       endpoint.  When this bit is '1' in either the MPA Request Frame
       or MPA Reply Frame, CRCs MUST be generated and checked by both
       endpoints.  Note that even when not in use, the CRC field remains
       present in the FPDU.  When CRCs are not in use, the CRC field
       MUST be considered valid for FPDU checking regardless of its
       contents.

   R: This bit is set to zero, and not checked on reception in the MPA
       Request Frame.  In the MPA Reply Frame, this bit is the Rejected
       Connection bit, set by the Responders ULP to indicate acceptance
       '0', or rejection '1', of the connection parameters provided in
       the Private Data.

   Res: This field is reserved for future use.  It MUST be set to zero
       when sending, and not checked on reception.

Culley, et al.              Standards Track                    [Page 27]
  <http://tools.ietf.org/html/rfc5044#page-28> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   Rev: This field contains the revision of MPA.  For this version of
       the specification, senders MUST set this field to one.  MPA
       receivers compliant with this version of the specification MUST
       check this field.  If the MPA receiver cannot interoperate with
       the received version, then it MUST close the connection and
       report an error locally.  Otherwise, the MPA receiver should
       report the received version to the ULP.

   PD_Length: This field MUST contain the length in octets of the
       Private Data field.  A value of zero indicates that there is no
       Private Data field present at all.  If the receiver detects that
       the PD_Length field does not match the length of the Private Data
       field, or if the length of the Private Data field exceeds 512
       octets, the receiver MUST close the connection and report an
       error locally.  Otherwise, the MPA receiver should pass the
       PD_Length value and Private Data to the ULP.

   Private Data: This field may contain any value defined by ULPs or may
       not be present.  The Private Data field MUST be between 0 and 512
       octets in length.  ULPs define how to size, set, and validate
       this field within these limits.  Private Data usage is further
       discussed in Section 7.1.4
<http://tools.ietf.org/html/rfc5044#section-7.1.4> .

7.1.2.  Connection Startup Rules

   The following rules apply to MPA connection Startup Phase:

   1.  When MPA is started in the Initiator mode, the MPA implementation
       MUST send a valid MPA Request Frame.  The MPA Request Frame MAY
       include ULP-supplied Private Data.

   2.  When MPA is started in the Responder mode, the MPA implementation
       MUST wait until an MPA Request Frame is received and validated
       before entering Full MPA/DDP Operation.

       If the MPA Request Frame is improperly formatted, the
       implementation MUST close the TCP connection and exit MPA.

       If the MPA Request Frame is properly formatted but the Private
       Data is not acceptable, the implementation SHOULD return an MPA
       Reply Frame with the Rejected Connection bit set to '1'; the MPA
       Reply Frame MAY include ULP-supplied Private Data; the
       implementation MUST exit MPA, leaving the TCP connection open.
       The ULP may close TCP or use the connection for other purposes.

       If the MPA Request Frame is properly formatted and the Private
       Data is acceptable, the implementation SHOULD return an MPA Reply
       Frame with the Rejected Connection bit set to '0'; the MPA Reply

Culley, et al.              Standards Track                    [Page 28]
  <http://tools.ietf.org/html/rfc5044#page-29> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

       Frame MAY include ULP-supplied Private Data; and the Responder
       SHOULD prepare to interpret any data received as FPDUs and pass
       any received ULPDUs to DDP.

       Note: Since the receiver's ability to deal with Markers is
           unknown until the Request and Reply Frames have been
           received, sending FPDUs before this occurs is not possible.

       Note: The requirement to wait on a Request Frame before sending a
           Reply Frame is a design choice.  It makes for a well-ordered
           sequence of events at each end, and avoids having to specify
           how to deal with situations where both ends start at the same
           time.

   3.  MPA Initiator mode implementations MUST receive and validate an
       MPA Reply Frame.

       If the MPA Reply Frame is improperly formatted, the
       implementation MUST close the TCP connection and exit MPA.

       If the MPA Reply Frame is properly formatted but is the Private
       Data is not acceptable, or if the Rejected Connection bit is set
       to '1', the implementation MUST exit MPA, leaving the TCP
       connection open.  The ULP may close TCP or use the connection for
       other purposes.

       If the MPA Reply Frame is properly formatted and the Private Data
       is acceptable, and the Reject Connection bit is set to '0', the
       implementation SHOULD enter Full MPA/DDP Operation Phase;
       interpreting any received data as FPDUs and sending DDP ULPDUs as
       FPDUs.

   4.  MPA Responder mode implementations MUST receive and validate at
       least one FPDU before sending any FPDUs or Markers.

       Note: This requirement is present to allow the Initiator time to
           get its receiver into Full Operation before an FPDU arrives,
           avoiding potential race conditions at the Initiator.  This
           was also subject to some debate in the work group before
           rough consensus was reached.  Eliminating this requirement
           would allow faster startup in some types of applications.
           However, that would also make certain implementations
           (particularly "dual stack") much harder.

   5.  If a received "Key" does not match the expected value (see
       Section 7.1.1 <http://tools.ietf.org/html/rfc5044#section-7.1.1>
, MPA Request and Reply Frame Format) the TCP/DDP
       connection MUST be closed, and an error returned to the ULP.

Culley, et al.              Standards Track                    [Page 29]
  <http://tools.ietf.org/html/rfc5044#page-30> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   6.  The received Private Data fields may be used by Consumers at
       either end to further validate the connection and set up DDP or
       other ULP parameters.  The Initiator ULP MAY close the
       TCP/MPA/DDP connection as a result of validating the Private Data
       fields.  The Responder SHOULD return an MPA Reply Frame with the
       "Reject Connection" bit set to '1' if the validation of the
       Private Data is not acceptable to the ULP.

   7.  When the first FPDU is to be sent, then if Markers are enabled,
       the first octets sent are the special Marker 0x00000000, followed
       by the start of the FPDU (the FPDU's ULPDU Length field).  If
       Markers are not enabled, the first octets sent are the start of
       the FPDU (the FPDU's ULPDU Length field).

   8.  MPA implementations MUST use the difference between the MPA
       Request Frame and the MPA Reply Frame to check for incorrect
       "Initiator/Initiator" startups.  Implementations SHOULD put a
       timeout on waiting for the MPA Request Frame when started in
       Responder mode, to detect incorrect "Responder/Responder"
       startups.

   9.  MPA implementations MUST validate the PD_Length field.  The
       buffer that receives the Private Data field MUST be large enough
       to receive that data; the amount of Private Data MUST not exceed
       the PD_Length or the application buffer.  If any of the above
       fails, the startup frame MUST be considered improperly formatted.

   10. MPA implementations SHOULD implement a reasonable timeout while
       waiting for the entire set of startup frames; this prevents
       certain denial-of-service attacks.  ULPs SHOULD implement a
       reasonable timeout while waiting for FPDUs, ULPDUs, and
       application level messages to guard against application failures
       and certain denial-of-service attacks.

7.1.3.  Example Delayed Startup Sequence

   A variety of startup sequences are possible when using MPA on TCP.
   Following is an example of an MPA/DDP startup that occurs after TCP
   has been running for a while and has exchanged some amount of
   streaming data.  This example does not use any Private Data (an
   example that does is shown later in Section 7.1.4.2
<http://tools.ietf.org/html/rfc5044#section-7.1.4.2> , Example
   Immediate Startup Using Private Data), although it is perfectly legal
   to include the Private Data.  Note that since the example does not
   use any Private Data, there are no ULP interactions shown between
   receiving "startup frames" and putting MPA into Full Operation.

Culley, et al.              Standards Track                    [Page 30]
  <http://tools.ietf.org/html/rfc5044#page-31> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

         Initiator                                 Responder

  +---------------------------+
  |ULP streaming mode         |
  |  <Hello> request to       |
  |  transition to DDP/MPA    |           +---------------------------+
  |  mode (optional).         | --------> |ULP gets request;          |
  +---------------------------+           |  enables MPA Responder    |
                                          |  mode with last (optional)|
                                          |  streaming mode           |
                                          |  <Hello Ack> for MPA to   |
                                          |  send.                    |
  +---------------------------+           |MPA waits for incoming     |
  |ULP receives streaming     | <-------- |  <MPA Request Frame>.     |
  |  <Hello Ack>;             |           +---------------------------+
  |Enters MPA Initiator mode; |
  |MPA sends                  |
  |  <MPA Request Frame>;     |
  |MPA waits for incoming     |           +---------------------------+
  |  <MPA Reply Frame>.       | - - - - > |MPA receives               |
  +---------------------------+           |  <MPA Request Frame>.     |
                                          |Consumer binds DDP to MPA; |
                                          |MPA sends the              |
                                          |  <MPA Reply Frame>.       |
                                          |DDP/MPA enables FPDU       |
  +---------------------------+           |  decoding, but does not   |
  |MPA receives the           | < - - - - |  send any FPDUs.          |
  |  <MPA Reply Frame>        |           +---------------------------+
  |Consumer binds DDP to MPA; |
  |DDP/MPA begins Full        |
  |  Operation.               |
  |MPA sends first FPDU (as   |           +---------------------------+
  |  DDP ULPDUs become        | ========> |MPA receives first FPDU.   |
  |  available).              |           |MPA sends first FPDU (as   |
  +---------------------------+           |  DDP ULPDUs become        |
                                  <====== |  available).              |
                                          +---------------------------+

              Figure 9: Example Delayed Startup Negotiation

Culley, et al.              Standards Track                    [Page 31]
  <http://tools.ietf.org/html/rfc5044#page-32> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   An example Delayed Startup sequence is described below:

       *   Active and passive sides start up a TCP connection in the
           usual fashion, probably using sockets APIs.  They exchange
           some amount of streaming mode data.  At some point, one side
           (the MPA Initiator) sends streaming mode data that
           effectively says "Hello, let's go into MPA/DDP mode".

   *   When the remote side (the MPA Responder) gets this streaming mode
       message, the Consumer would send a last streaming mode message
       that effectively says "I acknowledge your Hello, and am now in
       MPA Responder mode".  The exchange of these messages establishes
       the exact point in the TCP stream where MPA is enabled.  The
       Responding Consumer enables MPA in the Responder mode and waits
       for the initial MPA startup message.

       *   The Initiating Consumer would enable MPA startup in the
           Initiator mode which then sends the MPA Request Frame.  It is
           assumed that no Private Data messages are needed for this
           example, although it is possible to do so.  The Initiating
           MPA (and Consumer) would also wait for the MPA connection to
           be accepted.

   *   The Responding MPA would receive the initial MPA Request Frame
       and would inform the Consumer that this message arrived.  The
       Consumer can then accept the MPA/DDP connection or close the TCP
       connection.

   *   To accept the connection request, the Responding Consumer would
       use an appropriate API to bind the TCP/MPA connections to a DDP
       endpoint, thus enabling MPA/DDP into Full Operation.  In the
       process of going to Full Operation, MPA sends the MPA Reply
       Frame.  MPA/DDP waits for the first incoming FPDU before sending
       any FPDUs.

   *   If the initial TCP data was not a properly formatted MPA Request
       Frame, MPA will close or reset the TCP connection immediately.

       *   The Initiating MPA would receive the MPA Reply Frame and
           would report this message to the Consumer.  The Consumer can
           then accept the MPA/DDP connection, or close or reset the TCP
           connection to abort the process.

       *   On determining that the connection is acceptable, the
           Initiating Consumer would use an appropriate API to bind the
           TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
           into Full Operation.  MPA/DDP would begin sending DDP
           messages as MPA FPDUs.

Culley, et al.              Standards Track                    [Page 32]
  <http://tools.ietf.org/html/rfc5044#page-33> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

7.1.4.  Use of Private Data

   This section is advisory in nature, in that it suggests a method by
   which a ULP can deal with pre-DDP connection information exchange.

7.1.4.1.  Motivation

   Prior RDMA protocols have been developed that provide Private Data
   via out-of-band mechanisms.  As a result, many applications now
   expect some form of Private Data to be available for application use
   prior to setting up the DDP/RDMA connection.  Following are some
   examples of the use of Private Data.

   An RDMA endpoint (referred to as a Queue Pair, or QP, in InfiniBand
   and the [VERBS-RDMA
<http://tools.ietf.org/html/rfc5044#ref-VERBS-RDMA> ]) must be
associated with a Protection Domain.
   No receive operations may be posted to the endpoint before it is
   associated with a Protection Domain.  Indeed under both the
   InfiniBand and proposed RDMA/DDP verbs [VERBS-RDMA
<http://tools.ietf.org/html/rfc5044#ref-VERBS-RDMA> ] an endpoint/QP is
   created within a Protection Domain.

   There are some applications where the choice of Protection Domain is
   dependent upon the identity of the remote ULP client.  For example,
   if a user session requires multiple connections, it is highly
   desirable for all of those connections to use a single Protection
   Domain.  Note: Use of Protection Domains is further discussed in
   [RDMASEC <http://tools.ietf.org/html/rfc5044#ref-RDMASEC> ].

   InfiniBand, the DAT APIs [DAT-API
<http://tools.ietf.org/html/rfc5044#ref-DAT-API> ], and the IT-API
[IT-API <http://tools.ietf.org/html/rfc5044#ref-IT-API> ] all
   provide for the active-side ULP to provide Private Data when
   requesting a connection.  This data is passed to the ULP to allow it
   to determine whether to accept the connection, and if so with which
   endpoint (and implicitly which Protection Domain).

   The Private Data can also be used to ensure that both ends of the
   connection have configured their RDMA endpoints compatibly on such
   matters as the RDMA Read capacity (see [RDMAP
<http://tools.ietf.org/html/rfc5044#ref-RDMAP> ]).  Further ULP-
   specific uses are also presumed, such as establishing the identity of
   the client.

   Private Data is also allowed for when accepting the connection, to
   allow completion of any negotiation on RDMA resources and for other
   ULP reasons.

   There are several potential ways to exchange this Private Data.  For
   example, the InfiniBand specification includes a connection
   management protocol that allows a small amount of Private Data to be
   exchanged using datagrams before actually starting the RDMA
   connection.

Culley, et al.              Standards Track                    [Page 33]
  <http://tools.ietf.org/html/rfc5044#page-34> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   This document allows for small amounts of Private Data to be
   exchanged as part of the MPA startup sequence.  The actual Private
   Data fields are carried in the MPA Request Frame and the MPA Reply
   Frame.

   If larger amounts of Private Data or more negotiation is necessary,
   TCP streaming mode messages may be exchanged prior to enabling MPA.

Culley, et al.              Standards Track                    [Page 34]
  <http://tools.ietf.org/html/rfc5044#page-35> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

7.1.4.2.  Example Immediate Startup Using Private Data

          Initiator                                 Responder

   +---------------------------+
   |TCP SYN sent.              |           +--------------------------+
   +---------------------------+ --------> |TCP gets SYN packet;      |
   +---------------------------+           |  sends SYN-Ack.          |
   |TCP gets SYN-Ack           | <-------- +--------------------------+
   |  sends Ack.               |
   +---------------------------+ --------> +--------------------------+
   +---------------------------+           |Consumer enables MPA      |
   |Consumer enables MPA       |           |Responder mode, waits for |
   |Initiator mode with        |           |  <MPA Request frame>.    |
   |Private Data; MPA sends    |           +--------------------------+
   |  <MPA Request Frame>;     |
   |MPA waits for incoming     |           +--------------------------+
   |  <MPA Reply Frame>.       | - - - - > |MPA receives              |
   +---------------------------+           |  <MPA Request Frame>.    |
                                           |Consumer examines Private |
                                           |Data, provides MPA with   |
                                           |return Private Data,      |
                                           |binds DDP to MPA, and     |
                                           |enables MPA to send an    |
                                           |  <MPA Reply Frame>.      |
                                           |DDP/MPA enables FPDU      |
   +---------------------------+           |decoding, but does not    |
   |MPA receives the           | < - - - - |send any FPDUs.           |
   |  <MPA Reply Frame>.       |           +--------------------------+
   |Consumer examines Private  |
   |Data, binds DDP to MPA,    |
   |and enables DDP/MPA to     |
   |begin Full Operation.      |
   |MPA sends first FPDU (as   |           +--------------------------+
   |DDP ULPDUs become          | ========> |MPA receives first FPDU.  |
   |available).                |           |MPA sends first FPDU (as  |
   +---------------------------+           |DDP ULPDUs become         |
                                   <====== |available).               |
                                           +--------------------------+

             Figure 10: Example Immediate Startup Negotiation

   Note: The exact order of when MPA is started in the TCP connection
       sequence is implementation dependent; the above diagram shows one
       possible sequence.  Also, the Initiator "Ack" to the Responder's
       "SYN-Ack" may be combined into the same TCP segment containing
       the MPA Request Frame (as is allowed by TCP RFCs).

Culley, et al.              Standards Track                    [Page 35]
  <http://tools.ietf.org/html/rfc5044#page-36> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   The example immediate startup sequence is described below:

   *   The passive side (Responding Consumer) would listen on the TCP
       destination port, to indicate its readiness to accept a
       connection.

       *   The active side (Initiating Consumer) would request a
           connection from a TCP endpoint (that expected to upgrade to
           MPA/DDP/RDMA and expected the Private Data) to a destination
           address and port.

       *   The Initiating Consumer would initiate a TCP connection to
           the destination port.  Acceptance/rejection of the connection
           would proceed as per normal TCP connection establishment.

   *   The passive side (Responding Consumer) would receive the TCP
       connection request as usual allowing normal TCP gatekeepers, such
       as INETD and TCPserver, to exercise their normal
       safeguard/logging functions.  On acceptance of the TCP
       connection, the Responding Consumer would enable MPA in the
       Responder mode and wait for the initial MPA startup message.

       *   The Initiating Consumer would enable MPA startup in the
           Initiator mode to send an initial MPA Request Frame with its
           included Private Data message to send.  The Initiating MPA
           (and Consumer) would also wait for the MPA connection to be
           accepted, and any returned Private Data.

   *   The Responding MPA would receive the initial MPA Request Frame
       with the Private Data message and would pass the Private Data
       through to the Consumer.  The Consumer can then accept the
       MPA/DDP connection, close the TCP connection, or reject the MPA
       connection with a return message.

   *   To accept the connection request, the Responding Consumer would
       use an appropriate API to bind the TCP/MPA connections to a DDP
       endpoint, thus enabling MPA/DDP into Full Operation.  In the
       process of going to Full Operation, MPA sends the MPA Reply
       Frame, which includes the Consumer-supplied Private Data
       containing any appropriate Consumer response.  MPA/DDP waits for
       the first incoming FPDU before sending any FPDUs.

   *   If the initial TCP data was not a properly formatted MPA Request
       Frame, MPA will close or reset the TCP connection immediately.

Culley, et al.              Standards Track                    [Page 36]
  <http://tools.ietf.org/html/rfc5044#page-37> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   *   To reject the MPA connection request, the Responding Consumer
       would send an MPA Reply Frame with any ULP-supplied Private Data
       (with reason for rejection), with the "Rejected Connection" bit
       set to '1', and may close the TCP connection.

       *   The Initiating MPA would receive the MPA Reply Frame with the
           Private Data message and would report this message to the
           Consumer, including the supplied Private Data.

           If the "Rejected Connection" bit is set to a '1', MPA will
           close the TCP connection and exit.

           If the "Rejected Connection" bit is set to a '0', and on
           determining from the MPA Reply Frame Private Data that the
           connection is acceptable, the Initiating Consumer would use
           an appropriate API to bind the TCP/MPA connections to a DDP
           endpoint thus enabling MPA/DDP into Full Operation.  MPA/DDP
           would begin sending DDP messages as MPA FPDUs.

7.1.5.  "Dual Stack" Implementations

   MPA/DDP implementations are commonly expected to be implemented as
   part of a "dual stack" architecture.  One stack is the traditional
   TCP stack, usually with a sockets interface API (Application
   Programming Interface).  The second stack is the MPA/DDP stack with
   its own API, and potentially separate code or hardware to deal with
   the MPA/DDP data.  Of course, implementations may vary, so the
   following comments are of an advisory nature only.

   The use of the two stacks offers advantages:

       TCP connection setup is usually done with the TCP stack.  This
       allows use of the usual naming and addressing mechanisms.  It
       also means that any mechanisms used to "harden" the connection
       setup against security threats are also used when starting
       MPA/DDP.

       Some applications may have been originally designed for TCP, but
       are "enhanced" to utilize MPA/DDP after a negotiation reveals the
       capability to do so.  The negotiation process takes place in
       TCP's streaming mode, using the usual TCP APIs.

       Some new applications, designed for RDMA or DDP, still need to
       exchange some data prior to starting MPA/DDP.  This exchange can
       be of arbitrary length or complexity, but often consists of only
       a small amount of Private Data, perhaps only a single message.
       Using the TCP streaming mode for this exchange allows this to be
       done using well-understood methods.

Culley, et al.              Standards Track                    [Page 37]
  <http://tools.ietf.org/html/rfc5044#page-38> 
RFC 5044 <http://tools.ietf.org/html/rfc5044>                   MPA
Framing for TCP               October 2007

   The main disadvantage of using two stacks is the conversion of an
   active TCP connection between them.  This process must be done with
   care to prevent loss of data.

   To avoid some of the problems when using a "dual stack" architecture,
   the following additional restrictions may be required by the
   implementation:

   1.  Enabling the DDP/MPA stack SHOULD be done only when no incoming
       stream data is expected.  This is typically managed by the ULP
       protocol.  When following the recommended startup sequence, the
       Responder side enters DDP/MPA mode, sends the last streaming mode
       data, and then waits for the MPA Request Frame.  No additional
       streaming mode data is expected.  The Initiator side ULP receives
       the last streaming mode data, and then enters DDP/MPA mode.
       Again, no additional streaming mode data is expected.

   2.  The DDP/MPA MAY provide the ability to send a "last streaming
       message" as part of its Responder DDP/MPA enable function.  This
       allows the DDP/MPA stack to more easily manage the conversion to
       DDP/MPA mode (and avoid problems with a very fast return of the
       MPA Request Frame from the Initiator side).

   Note: Regardless of the "stack" architecture used, TCP's rules MUST
       be followed.  For example, if network data is lost, re-segmented,
       or re-ordered, TCP MUST recover appropriately even when this
       occurs while switching stacks.

Arkady Kanevsky                       email: arkady at netapp.com

Network Appliance Inc.               phone: 781-768-5395

1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195

Waltham, MA 02451                   central phone: 781-768-5300

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20071121/a631e5ff/attachment.html>