[ofa-general] Re: iWARP peer-to-peer CM proposal

Wed Nov 21 15:30:29 PST 2007

Very good points.
Thanks Steve.

If we can do unsignalled 0-size RDMA Read with "bogus" S-tag this may
work better.
Yes, it will require IRD not to be 0 set at Responder.
Ditto ORD of at least 1 on Responder.
There is no need to have extra CQ entry on either side for it.
It is only needed for error path.
So this will only be needed if Sender posted the full queue of sends.
But it can not post anything because CM will not let it know that
connection is established.

Happy Thanksgiving,

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300

> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com] 
> Sent: Wednesday, November 21, 2007 1:07 PM
> To: Kanevsky, Arkady
> Cc: Glenn Grundstrom; Leonid Grossman; openib-general at openib.org
> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
> 
> Comments in-line below...
> 
> 
> Kanevsky, Arkady wrote:
> > 
> >     Group,
> > 
> > 
> >     below is proposal on how to resolve peer-to-peer iWARP CM issue
> >     discovered at interop event.
> > 
> > 
> >     The main issue is that MPA spec (relevant portion of 
> IETF RFC 5044
> >     is below) require that
> > 
> > 
> >     connection initiator send first message over the 
> established connection.
> > 
> > 
> >     Multiple MPI implementations and several other apps use 
> peer-to-peer
> >     model.
> > 
> > 
> >     So rather then forcing all of them to do it on their 
> own, which will
> >     not help with
> > 
> > 
> >     interop between different implementations, the goal is to extend
> >     lower layers to provide it.
> > 
> > 
> >      
> > 
> > 
> >     Our first idea was to leave MPA protocol untouched and 
> try to solve
> >     this problem
> > 
> > 
> >     in iw_cm. But there are too many complications to it. First, in
> >     order to adhere to RFC5044
> > 
> > 
> >     initiator must send first FPDU and responder process 
> it. But since
> >     the connection is already
> > 
> > 
> >     established processing FPDU involves ULP on whose behalf the
> >     connection is created.
> > 
> > 
> >     So either initiator sends a message which generates 
> completion on
> >     responder CQ, thus visible
> > 
> > 
> >     to ULP, or not. 
> 
> 
> 
> > In the later case, the only op which can do it is
> >     RDMA one, which means
> > 
> > 
> >     that responder somehow provided initiator S-tag which 
> it can use.
> >     So, this is an extension
> > 
> > 
> >     to MPA, probably using private data. And that responder upon
> >     receiving it destroy this S-tag.
> > 
> > 
> >     In any case this is an extension of MPA.
> > 
> 
> 
> This stag exchange isn't needed if this RDMA op is a 0B READ. 
>  The responder waits for that 0B read and only indicates the 
> rdma connection is established to its ULP when it replies to 
> the 0B read.  In this scenario, the responder/server side 
> doesn't consume any CQ resources. 
> But it would require an IRD of at least 1 to be configured on the QP. 
> The initiator still requires an SQ entry, and possibly a CQ 
> entry, for initiating the 0B read and handling completion.  
> But its perhaps a little less painful than doing a SEND/RECV 
> exchange.  The read wr could be unsignaled so that it won't 
> generate a CQE.  But it still consumes an SQ WR slot so the 
> SQ would have to be sized to allow this extra WR. And I guess 
> the CQ would also need to be sized accordingly in case the 
> read failed.
> 
> > 
> >     In the former, Send is used but this requires a buffer 
> to be posted
> >     to CQ. But since
> > 
> > 
> >     the same CQ (or SharedCQ) can be used by other 
> connections at the
> >     same time it can cause
> > 
> > 
> >     the responder CM posted buffer to be consumed by other 
> connection.
> >     This is not acceptable.
> > 
> > 
> >      
> > 
> > 
> >     So new we consider extension to MPA protocol.
> > 
> > 
> >     The goal is to be completely backwards compatible to 
> existing version 1.
> > 
> > 
> >     In a nutshell, use a "flag" in the MPA request message which
> >     indicates that
> > 
> > 
> >     "ready to receive" message will be send by requestor upon 
> > receiving
> > 
> > 
> >     MPA response message with connection acceptance.
> > 
> > 
> >      
> > 
> > 
> >     here are the changes to IETF RFC5044
> > 
> > 
> >      
> > 
> > 
> >     1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
> 2 3 4 5 6 7 8
> >     9 0 1
> >     
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
> >     | | + Key (16 bytes containing "MPA ID Req Frame") + 4 
> | (4D 50 41
> >     20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 
> bytes containing
> >     "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65 
> 70 20 46 72 61
> >     6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame") 
> + 12 | (4D 50
> >     41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
> >     
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
> >     |M|C|R|S| Res | Rev | PD_Length |
> >     
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
> >     | ~ ~ ~ Private Data ~ | | | 
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
> >     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > 
> > 
> >      
> > 
> > 
> >     2. S: indicator in the Req frame whether or not 
> Requestor will send
> >     Rtr frame.
> > 
> > 
> >           In Req frame, if set to 1 then Rtr frame will be sent if 
> > responder
> > 
> > 
> >         sends Rep frame with accept bit set. 0 indicate 
> that Rtr frame
> > 
> > 
> >         will not be sent.
> > 
> > 
> >         In Rep frame, 0 means that Responder cannot support 
> Rtr frame,
> > 
> > 
> >         while 1 that it is and is waiting for it.
> > 
> > 
> >         (While my preference is to handle this as MPA 
> protocol version
> >     matching rules,
> > 
> > 
> >         proposed method will provide complete backwards 
> compatibility)
> > 
> > 
> >         Unused by Rtr frame. That is set to 0 in Rtr frame 
> and ignored
> > 
> > 
> >         by responder.
> > 
> > 
> >      
> > 
> > 
> >         All other bits M,C,R and remainder of Res treated 
> as in MPA ver 1.
> > 
> > 
> >        
> > 
> > 
> >         Rtr frame adhere to C bit as specified in Rep frame
> > 
> > 
> 
> First, the RTR frame _must_ be an FPDU for this to work.  
> Thus it violates the DDP/RDMAP specs because it is an known 
> DDP/RDMAP opcode.
> 
> Second, assuming the RTR frame is sent as an FPDU, then this 
> won't work with existing RNIC HW.  The HW will post an async 
> error because the incoming DDP/RDMAP opcode is unknown.
> 
> The only way I see that we can fix this for the existing rnic 
> HW is to come up with some way to send a valid RDMAP message 
> from the initiator to the responder under the covers -and- 
> have the responder only indicate that the connection is 
> established when that FPDU is received.
> 
> Chelsio cannot support this hack via a 0B write, but the 
> could support a 0B read or send/recv exchange.  But as you 
> indicate, this is very painful and perhaps impossible to do 
> without impacting the ULP and breaking verbs semantics.
> 
> (that's why we punted on this a year ago :)
> 
> 
> Steve.
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
>