[ofa-general] Re: iWARP peer-to-peer CM proposal

Wed Nov 28 05:13:04 PST 2007

Any posting to SQ prior to connection establishment will complete
"immideately" with the "flashed" status.

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300

> -----Original Message-----
> From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com] 
> Sent: Sunday, November 25, 2007 9:00 PM
> To: Steve Wise; Kanevsky, Arkady
> Cc: Leonid Grossman; openib-general at openib.org
> Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
> 
> > 
> > Kanevsky, Arkady wrote:
> > > Very good points.
> > > Thanks Steve.
> > >
> > > If we can do unsignalled 0-size RDMA Read with "bogus" 
> > S-tag this may
> > > work better.
> > > Yes, it will require IRD not to be 0 set at Responder.
> > > Ditto ORD of at least 1 on Responder.
> > > There is no need to have extra CQ entry on either side for it.
> > > It is only needed for error path.
> > > So this will only be needed if Sender posted the full queue
> > of sends.
> > > But it can not post anything because CM will not let it know that 
> > > connection is established.
> > >
> > >   
> > Well, actually, I think the ULP _can_ post before establishing the 
> > connection.  But I guess we can define the semantics such that 
> > applications using the rdma-cm interface must adhere to whatever we 
> > need to make this hack work.
> > 
> > Q: are there apps using the rdma-cm out there today that 
> pre-post SQ 
> > WRs before getting a ESTABLISHED event?
> > 
> > Steve.
> 
> ULPs are allowed to post prior to establishing the 
> connection, but I can't name any that operate this way.  
> Prohibiting applications that use the rdma_cm directly from 
> pre-posting is okay, but what about ULP's over other ULP's 
> (i.e. MPI over uDAPL).  How can/will this be handled?
> 
> Glenn.
> 
> 
> > > Happy Thanksgiving,
> > >
> > > Arkady Kanevsky                       email: arkady at netapp.com
> > > Network Appliance Inc.               phone: 781-768-5395
> > > 1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
> > > Waltham, MA 02451                   central phone: 781-768-5300
> > >  
> > >
> > >   
> > >> -----Original Message-----
> > >> From: Steve Wise [mailto:swise at opengridcomputing.com]
> > >> Sent: Wednesday, November 21, 2007 1:07 PM
> > >> To: Kanevsky, Arkady
> > >> Cc: Glenn Grundstrom; Leonid Grossman; openib-general at openib.org
> > >> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
> > >>
> > >> Comments in-line below...
> > >>
> > >>
> > >> Kanevsky, Arkady wrote:
> > >>     
> > >>>     Group,
> > >>>
> > >>>
> > >>>     below is proposal on how to resolve peer-to-peer
> > iWARP CM issue
> > >>>     discovered at interop event.
> > >>>
> > >>>
> > >>>     The main issue is that MPA spec (relevant portion of
> > >>>       
> > >> IETF RFC 5044
> > >>     
> > >>>     is below) require that
> > >>>
> > >>>
> > >>>     connection initiator send first message over the
> > >>>       
> > >> established connection.
> > >>     
> > >>>     Multiple MPI implementations and several other apps use
> > >>>       
> > >> peer-to-peer
> > >>     
> > >>>     model.
> > >>>
> > >>>
> > >>>     So rather then forcing all of them to do it on their
> > >>>       
> > >> own, which will
> > >>     
> > >>>     not help with
> > >>>
> > >>>
> > >>>     interop between different implementations, the goal
> > is to extend
> > >>>     lower layers to provide it.
> > >>>
> > >>>
> > >>>      
> > >>>
> > >>>
> > >>>     Our first idea was to leave MPA protocol untouched and
> > >>>       
> > >> try to solve
> > >>     
> > >>>     this problem
> > >>>
> > >>>
> > >>>     in iw_cm. But there are too many complications to it. 
> > First, in
> > >>>     order to adhere to RFC5044
> > >>>
> > >>>
> > >>>     initiator must send first FPDU and responder process
> > >>>       
> > >> it. But since
> > >>     
> > >>>     the connection is already
> > >>>
> > >>>
> > >>>     established processing FPDU involves ULP on whose behalf the
> > >>>     connection is created.
> > >>>
> > >>>
> > >>>     So either initiator sends a message which generates
> > >>>       
> > >> completion on
> > >>     
> > >>>     responder CQ, thus visible
> > >>>
> > >>>
> > >>>     to ULP, or not. 
> > >>>       
> > >>
> > >>     
> > >>> In the later case, the only op which can do it is
> > >>>     RDMA one, which means
> > >>>
> > >>>
> > >>>     that responder somehow provided initiator S-tag which
> > >>>       
> > >> it can use.
> > >>     
> > >>>     So, this is an extension
> > >>>
> > >>>
> > >>>     to MPA, probably using private data. And that responder upon
> > >>>     receiving it destroy this S-tag.
> > >>>
> > >>>
> > >>>     In any case this is an extension of MPA.
> > >>>
> > >>>       
> > >> This stag exchange isn't needed if this RDMA op is a 0B READ. 
> > >>  The responder waits for that 0B read and only indicates 
> the rdma 
> > >> connection is established to its ULP when it replies to the 0B 
> > >> read.  In this scenario, the responder/server side 
> doesn't consume 
> > >> any CQ resources.
> > >> But it would require an IRD of at least 1 to be configured
> > on the QP. 
> > >> The initiator still requires an SQ entry, and possibly a 
> CQ entry, 
> > >> for initiating the 0B read and handling completion.
> > >> But its perhaps a little less painful than doing a SEND/RECV 
> > >> exchange.  The read wr could be unsignaled so that it won't 
> > >> generate a CQE.  But it still consumes an SQ WR slot so the SQ 
> > >> would have to be sized to allow this extra WR. And I 
> guess the CQ 
> > >> would also need to be sized accordingly in case the read failed.
> > >>
> > >>     
> > >>>     In the former, Send is used but this requires a buffer
> > >>>       
> > >> to be posted
> > >>     
> > >>>     to CQ. But since
> > >>>
> > >>>
> > >>>     the same CQ (or SharedCQ) can be used by other
> > >>>       
> > >> connections at the
> > >>     
> > >>>     same time it can cause
> > >>>
> > >>>
> > >>>     the responder CM posted buffer to be consumed by other
> > >>>       
> > >> connection.
> > >>     
> > >>>     This is not acceptable.
> > >>>
> > >>>
> > >>>      
> > >>>
> > >>>
> > >>>     So new we consider extension to MPA protocol.
> > >>>
> > >>>
> > >>>     The goal is to be completely backwards compatible to
> > >>>       
> > >> existing version 1.
> > >>     
> > >>>     In a nutshell, use a "flag" in the MPA request message which
> > >>>     indicates that
> > >>>
> > >>>
> > >>>     "ready to receive" message will be send by requestor upon 
> > >>> receiving
> > >>>
> > >>>
> > >>>     MPA response message with connection acceptance.
> > >>>
> > >>>
> > >>>      
> > >>>
> > >>>
> > >>>     here are the changes to IETF RFC5044
> > >>>
> > >>>
> > >>>      
> > >>>
> > >>>
> > >>>     1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> > >>>       
> > >> 2 3 4 5 6 7 8
> > >>     
> > >>>     9 0 1
> > >>>     
> > >>>       
> > >> 
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
> > >>     
> > >>>     | | + Key (16 bytes containing "MPA ID Req Frame") + 4
> > >>>       
> > >> | (4D 50 41
> > >>     
> > >>>     20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16
> > >>>       
> > >> bytes containing
> > >>     
> > >>>     "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65
> > >>>       
> > >> 70 20 46 72 61
> > >>     
> > >>>     6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame")
> > >>>       
> > >> + 12 | (4D 50
> > >>     
> > >>>     41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
> > >>>     
> > >>>       
> > >> 
> > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
> > >>     
> > >>>     |M|C|R|S| Res | Rev | PD_Length |
> > >>>     
> > >>>       
> > >> 
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
> > >>     
> > >>>     | ~ ~ ~ Private Data ~ | | |
> > >>>       
> > >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
> > >>     
> > >>>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > >>>
> > >>>
> > >>>      
> > >>>
> > >>>
> > >>>     2. S: indicator in the Req frame whether or not
> > >>>       
> > >> Requestor will send
> > >>     
> > >>>     Rtr frame.
> > >>>
> > >>>
> > >>>           In Req frame, if set to 1 then Rtr frame will
> > be sent if
> > >>> responder
> > >>>
> > >>>
> > >>>         sends Rep frame with accept bit set. 0 indicate
> > >>>       
> > >> that Rtr frame
> > >>     
> > >>>         will not be sent.
> > >>>
> > >>>
> > >>>         In Rep frame, 0 means that Responder cannot support
> > >>>       
> > >> Rtr frame,
> > >>     
> > >>>         while 1 that it is and is waiting for it.
> > >>>
> > >>>
> > >>>         (While my preference is to handle this as MPA
> > >>>       
> > >> protocol version
> > >>     
> > >>>     matching rules,
> > >>>
> > >>>
> > >>>         proposed method will provide complete backwards
> > >>>       
> > >> compatibility)
> > >>     
> > >>>         Unused by Rtr frame. That is set to 0 in Rtr frame
> > >>>       
> > >> and ignored
> > >>     
> > >>>         by responder.
> > >>>
> > >>>
> > >>>      
> > >>>
> > >>>
> > >>>         All other bits M,C,R and remainder of Res treated
> > >>>       
> > >> as in MPA ver 1.
> > >>     
> > >>>        
> > >>>
> > >>>
> > >>>         Rtr frame adhere to C bit as specified in Rep frame
> > >>>
> > >>>
> > >>>       
> > >> First, the RTR frame _must_ be an FPDU for this to work.  
> > >> Thus it violates the DDP/RDMAP specs because it is an known 
> > >> DDP/RDMAP opcode.
> > >>
> > >> Second, assuming the RTR frame is sent as an FPDU, then 
> this won't 
> > >> work with existing RNIC HW.  The HW will post an async error 
> > >> because the incoming DDP/RDMAP opcode is unknown.
> > >>
> > >> The only way I see that we can fix this for the existing 
> rnic HW is 
> > >> to come up with some way to send a valid RDMAP message from the 
> > >> initiator to the responder under the covers -and- have the 
> > >> responder only indicate that the connection is established when 
> > >> that FPDU is received.
> > >>
> > >> Chelsio cannot support this hack via a 0B write, but the could 
> > >> support a 0B read or send/recv exchange.  But as you 
> indicate, this 
> > >> is very painful and perhaps impossible to do without 
> impacting the 
> > >> ULP and breaking verbs semantics.
> > >>
> > >> (that's why we punted on this a year ago :)
> > >>
> > >>
> > >> Steve.
> > >>
> > >> _______________________________________________
> > >> general mailing list
> > >> general at lists.openfabrics.org
> > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >>
> > >> To unsubscribe, please visit
> > >> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >>     
> > 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
>