[ofa-general] Re: iWARP peer-to-peer CM proposal

Sun Nov 25 17:59:40 PST 2007

> 
> Kanevsky, Arkady wrote:
> > Very good points.
> > Thanks Steve.
> >
> > If we can do unsignalled 0-size RDMA Read with "bogus" 
> S-tag this may
> > work better.
> > Yes, it will require IRD not to be 0 set at Responder.
> > Ditto ORD of at least 1 on Responder.
> > There is no need to have extra CQ entry on either side for it.
> > It is only needed for error path.
> > So this will only be needed if Sender posted the full queue 
> of sends.
> > But it can not post anything because CM will not let it know that
> > connection is established.
> >
> >   
> Well, actually, I think the ULP _can_ post before establishing the 
> connection.  But I guess we can define the semantics such that 
> applications using the rdma-cm interface must adhere to 
> whatever we need 
> to make this hack work.
> 
> Q: are there apps using the rdma-cm out there today that 
> pre-post SQ WRs 
> before getting a ESTABLISHED event?
> 
> Steve.

ULPs are allowed to post prior to establishing the connection, but I
can't name any that operate this way.  Prohibiting applications
that use the rdma_cm directly from pre-posting is okay, but what
about ULP's over other ULP's (i.e. MPI over uDAPL).  How can/will
this be handled?

Glenn.

> > Happy Thanksgiving,
> >
> > Arkady Kanevsky                       email: arkady at netapp.com
> > Network Appliance Inc.               phone: 781-768-5395
> > 1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
> > Waltham, MA 02451                   central phone: 781-768-5300
> >  
> >
> >   
> >> -----Original Message-----
> >> From: Steve Wise [mailto:swise at opengridcomputing.com] 
> >> Sent: Wednesday, November 21, 2007 1:07 PM
> >> To: Kanevsky, Arkady
> >> Cc: Glenn Grundstrom; Leonid Grossman; openib-general at openib.org
> >> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
> >>
> >> Comments in-line below...
> >>
> >>
> >> Kanevsky, Arkady wrote:
> >>     
> >>>     Group,
> >>>
> >>>
> >>>     below is proposal on how to resolve peer-to-peer 
> iWARP CM issue
> >>>     discovered at interop event.
> >>>
> >>>
> >>>     The main issue is that MPA spec (relevant portion of 
> >>>       
> >> IETF RFC 5044
> >>     
> >>>     is below) require that
> >>>
> >>>
> >>>     connection initiator send first message over the 
> >>>       
> >> established connection.
> >>     
> >>>     Multiple MPI implementations and several other apps use 
> >>>       
> >> peer-to-peer
> >>     
> >>>     model.
> >>>
> >>>
> >>>     So rather then forcing all of them to do it on their 
> >>>       
> >> own, which will
> >>     
> >>>     not help with
> >>>
> >>>
> >>>     interop between different implementations, the goal 
> is to extend
> >>>     lower layers to provide it.
> >>>
> >>>
> >>>      
> >>>
> >>>
> >>>     Our first idea was to leave MPA protocol untouched and 
> >>>       
> >> try to solve
> >>     
> >>>     this problem
> >>>
> >>>
> >>>     in iw_cm. But there are too many complications to it. 
> First, in
> >>>     order to adhere to RFC5044
> >>>
> >>>
> >>>     initiator must send first FPDU and responder process 
> >>>       
> >> it. But since
> >>     
> >>>     the connection is already
> >>>
> >>>
> >>>     established processing FPDU involves ULP on whose behalf the
> >>>     connection is created.
> >>>
> >>>
> >>>     So either initiator sends a message which generates 
> >>>       
> >> completion on
> >>     
> >>>     responder CQ, thus visible
> >>>
> >>>
> >>>     to ULP, or not. 
> >>>       
> >>
> >>     
> >>> In the later case, the only op which can do it is
> >>>     RDMA one, which means
> >>>
> >>>
> >>>     that responder somehow provided initiator S-tag which 
> >>>       
> >> it can use.
> >>     
> >>>     So, this is an extension
> >>>
> >>>
> >>>     to MPA, probably using private data. And that responder upon
> >>>     receiving it destroy this S-tag.
> >>>
> >>>
> >>>     In any case this is an extension of MPA.
> >>>
> >>>       
> >> This stag exchange isn't needed if this RDMA op is a 0B READ. 
> >>  The responder waits for that 0B read and only indicates the 
> >> rdma connection is established to its ULP when it replies to 
> >> the 0B read.  In this scenario, the responder/server side 
> >> doesn't consume any CQ resources. 
> >> But it would require an IRD of at least 1 to be configured 
> on the QP. 
> >> The initiator still requires an SQ entry, and possibly a CQ 
> >> entry, for initiating the 0B read and handling completion.  
> >> But its perhaps a little less painful than doing a SEND/RECV 
> >> exchange.  The read wr could be unsignaled so that it won't 
> >> generate a CQE.  But it still consumes an SQ WR slot so the 
> >> SQ would have to be sized to allow this extra WR. And I guess 
> >> the CQ would also need to be sized accordingly in case the 
> >> read failed.
> >>
> >>     
> >>>     In the former, Send is used but this requires a buffer 
> >>>       
> >> to be posted
> >>     
> >>>     to CQ. But since
> >>>
> >>>
> >>>     the same CQ (or SharedCQ) can be used by other 
> >>>       
> >> connections at the
> >>     
> >>>     same time it can cause
> >>>
> >>>
> >>>     the responder CM posted buffer to be consumed by other 
> >>>       
> >> connection.
> >>     
> >>>     This is not acceptable.
> >>>
> >>>
> >>>      
> >>>
> >>>
> >>>     So new we consider extension to MPA protocol.
> >>>
> >>>
> >>>     The goal is to be completely backwards compatible to 
> >>>       
> >> existing version 1.
> >>     
> >>>     In a nutshell, use a "flag" in the MPA request message which
> >>>     indicates that
> >>>
> >>>
> >>>     "ready to receive" message will be send by requestor upon 
> >>> receiving
> >>>
> >>>
> >>>     MPA response message with connection acceptance.
> >>>
> >>>
> >>>      
> >>>
> >>>
> >>>     here are the changes to IETF RFC5044
> >>>
> >>>
> >>>      
> >>>
> >>>
> >>>     1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
> >>>       
> >> 2 3 4 5 6 7 8
> >>     
> >>>     9 0 1
> >>>     
> >>>       
> >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
> >>     
> >>>     | | + Key (16 bytes containing "MPA ID Req Frame") + 4 
> >>>       
> >> | (4D 50 41
> >>     
> >>>     20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 
> >>>       
> >> bytes containing
> >>     
> >>>     "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65 
> >>>       
> >> 70 20 46 72 61
> >>     
> >>>     6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame") 
> >>>       
> >> + 12 | (4D 50
> >>     
> >>>     41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
> >>>     
> >>>       
> >> 
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
> >>     
> >>>     |M|C|R|S| Res | Rev | PD_Length |
> >>>     
> >>>       
> >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
> >>     
> >>>     | ~ ~ ~ Private Data ~ | | | 
> >>>       
> >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
> >>     
> >>>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> >>>
> >>>
> >>>      
> >>>
> >>>
> >>>     2. S: indicator in the Req frame whether or not 
> >>>       
> >> Requestor will send
> >>     
> >>>     Rtr frame.
> >>>
> >>>
> >>>           In Req frame, if set to 1 then Rtr frame will 
> be sent if 
> >>> responder
> >>>
> >>>
> >>>         sends Rep frame with accept bit set. 0 indicate 
> >>>       
> >> that Rtr frame
> >>     
> >>>         will not be sent.
> >>>
> >>>
> >>>         In Rep frame, 0 means that Responder cannot support 
> >>>       
> >> Rtr frame,
> >>     
> >>>         while 1 that it is and is waiting for it.
> >>>
> >>>
> >>>         (While my preference is to handle this as MPA 
> >>>       
> >> protocol version
> >>     
> >>>     matching rules,
> >>>
> >>>
> >>>         proposed method will provide complete backwards 
> >>>       
> >> compatibility)
> >>     
> >>>         Unused by Rtr frame. That is set to 0 in Rtr frame 
> >>>       
> >> and ignored
> >>     
> >>>         by responder.
> >>>
> >>>
> >>>      
> >>>
> >>>
> >>>         All other bits M,C,R and remainder of Res treated 
> >>>       
> >> as in MPA ver 1.
> >>     
> >>>        
> >>>
> >>>
> >>>         Rtr frame adhere to C bit as specified in Rep frame
> >>>
> >>>
> >>>       
> >> First, the RTR frame _must_ be an FPDU for this to work.  
> >> Thus it violates the DDP/RDMAP specs because it is an known 
> >> DDP/RDMAP opcode.
> >>
> >> Second, assuming the RTR frame is sent as an FPDU, then this 
> >> won't work with existing RNIC HW.  The HW will post an async 
> >> error because the incoming DDP/RDMAP opcode is unknown.
> >>
> >> The only way I see that we can fix this for the existing rnic 
> >> HW is to come up with some way to send a valid RDMAP message 
> >> from the initiator to the responder under the covers -and- 
> >> have the responder only indicate that the connection is 
> >> established when that FPDU is received.
> >>
> >> Chelsio cannot support this hack via a 0B write, but the 
> >> could support a 0B read or send/recv exchange.  But as you 
> >> indicate, this is very painful and perhaps impossible to do 
> >> without impacting the ULP and breaking verbs semantics.
> >>
> >> (that's why we punted on this a year ago :)
> >>
> >>
> >> Steve.
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit 
> >> http://openib.org/mailman/listinfo/openib-general
> >>
> >>     
>