[ofa-general] Re: iWARP peer-to-peer CM proposal

Fri Nov 23 07:35:37 PST 2007

Kanevsky, Arkady wrote:
> Very good points.
> Thanks Steve.
>
> If we can do unsignalled 0-size RDMA Read with "bogus" S-tag this may
> work better.
> Yes, it will require IRD not to be 0 set at Responder.
> Ditto ORD of at least 1 on Responder.
> There is no need to have extra CQ entry on either side for it.
> It is only needed for error path.
> So this will only be needed if Sender posted the full queue of sends.
> But it can not post anything because CM will not let it know that
> connection is established.
>
>   
Well, actually, I think the ULP _can_ post before establishing the 
connection.  But I guess we can define the semantics such that 
applications using the rdma-cm interface must adhere to whatever we need 
to make this hack work.

Q: are there apps using the rdma-cm out there today that pre-post SQ WRs 
before getting a ESTABLISHED event?

Steve.
> Happy Thanksgiving,
>
> Arkady Kanevsky                       email: arkady at netapp.com
> Network Appliance Inc.               phone: 781-768-5395
> 1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
> Waltham, MA 02451                   central phone: 781-768-5300
>  
>
>   
>> -----Original Message-----
>> From: Steve Wise [mailto:swise at opengridcomputing.com] 
>> Sent: Wednesday, November 21, 2007 1:07 PM
>> To: Kanevsky, Arkady
>> Cc: Glenn Grundstrom; Leonid Grossman; openib-general at openib.org
>> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
>>
>> Comments in-line below...
>>
>>
>> Kanevsky, Arkady wrote:
>>     
>>>     Group,
>>>
>>>
>>>     below is proposal on how to resolve peer-to-peer iWARP CM issue
>>>     discovered at interop event.
>>>
>>>
>>>     The main issue is that MPA spec (relevant portion of 
>>>       
>> IETF RFC 5044
>>     
>>>     is below) require that
>>>
>>>
>>>     connection initiator send first message over the 
>>>       
>> established connection.
>>     
>>>     Multiple MPI implementations and several other apps use 
>>>       
>> peer-to-peer
>>     
>>>     model.
>>>
>>>
>>>     So rather then forcing all of them to do it on their 
>>>       
>> own, which will
>>     
>>>     not help with
>>>
>>>
>>>     interop between different implementations, the goal is to extend
>>>     lower layers to provide it.
>>>
>>>
>>>      
>>>
>>>
>>>     Our first idea was to leave MPA protocol untouched and 
>>>       
>> try to solve
>>     
>>>     this problem
>>>
>>>
>>>     in iw_cm. But there are too many complications to it. First, in
>>>     order to adhere to RFC5044
>>>
>>>
>>>     initiator must send first FPDU and responder process 
>>>       
>> it. But since
>>     
>>>     the connection is already
>>>
>>>
>>>     established processing FPDU involves ULP on whose behalf the
>>>     connection is created.
>>>
>>>
>>>     So either initiator sends a message which generates 
>>>       
>> completion on
>>     
>>>     responder CQ, thus visible
>>>
>>>
>>>     to ULP, or not. 
>>>       
>>
>>     
>>> In the later case, the only op which can do it is
>>>     RDMA one, which means
>>>
>>>
>>>     that responder somehow provided initiator S-tag which 
>>>       
>> it can use.
>>     
>>>     So, this is an extension
>>>
>>>
>>>     to MPA, probably using private data. And that responder upon
>>>     receiving it destroy this S-tag.
>>>
>>>
>>>     In any case this is an extension of MPA.
>>>
>>>       
>> This stag exchange isn't needed if this RDMA op is a 0B READ. 
>>  The responder waits for that 0B read and only indicates the 
>> rdma connection is established to its ULP when it replies to 
>> the 0B read.  In this scenario, the responder/server side 
>> doesn't consume any CQ resources. 
>> But it would require an IRD of at least 1 to be configured on the QP. 
>> The initiator still requires an SQ entry, and possibly a CQ 
>> entry, for initiating the 0B read and handling completion.  
>> But its perhaps a little less painful than doing a SEND/RECV 
>> exchange.  The read wr could be unsignaled so that it won't 
>> generate a CQE.  But it still consumes an SQ WR slot so the 
>> SQ would have to be sized to allow this extra WR. And I guess 
>> the CQ would also need to be sized accordingly in case the 
>> read failed.
>>
>>     
>>>     In the former, Send is used but this requires a buffer 
>>>       
>> to be posted
>>     
>>>     to CQ. But since
>>>
>>>
>>>     the same CQ (or SharedCQ) can be used by other 
>>>       
>> connections at the
>>     
>>>     same time it can cause
>>>
>>>
>>>     the responder CM posted buffer to be consumed by other 
>>>       
>> connection.
>>     
>>>     This is not acceptable.
>>>
>>>
>>>      
>>>
>>>
>>>     So new we consider extension to MPA protocol.
>>>
>>>
>>>     The goal is to be completely backwards compatible to 
>>>       
>> existing version 1.
>>     
>>>     In a nutshell, use a "flag" in the MPA request message which
>>>     indicates that
>>>
>>>
>>>     "ready to receive" message will be send by requestor upon 
>>> receiving
>>>
>>>
>>>     MPA response message with connection acceptance.
>>>
>>>
>>>      
>>>
>>>
>>>     here are the changes to IETF RFC5044
>>>
>>>
>>>      
>>>
>>>
>>>     1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
>>>       
>> 2 3 4 5 6 7 8
>>     
>>>     9 0 1
>>>     
>>>       
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
>>     
>>>     | | + Key (16 bytes containing "MPA ID Req Frame") + 4 
>>>       
>> | (4D 50 41
>>     
>>>     20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16 
>>>       
>> bytes containing
>>     
>>>     "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65 
>>>       
>> 70 20 46 72 61
>>     
>>>     6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame") 
>>>       
>> + 12 | (4D 50
>>     
>>>     41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
>>>     
>>>       
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
>>     
>>>     |M|C|R|S| Res | Rev | PD_Length |
>>>     
>>>       
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
>>     
>>>     | ~ ~ ~ Private Data ~ | | | 
>>>       
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
>>     
>>>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>>
>>>
>>>      
>>>
>>>
>>>     2. S: indicator in the Req frame whether or not 
>>>       
>> Requestor will send
>>     
>>>     Rtr frame.
>>>
>>>
>>>           In Req frame, if set to 1 then Rtr frame will be sent if 
>>> responder
>>>
>>>
>>>         sends Rep frame with accept bit set. 0 indicate 
>>>       
>> that Rtr frame
>>     
>>>         will not be sent.
>>>
>>>
>>>         In Rep frame, 0 means that Responder cannot support 
>>>       
>> Rtr frame,
>>     
>>>         while 1 that it is and is waiting for it.
>>>
>>>
>>>         (While my preference is to handle this as MPA 
>>>       
>> protocol version
>>     
>>>     matching rules,
>>>
>>>
>>>         proposed method will provide complete backwards 
>>>       
>> compatibility)
>>     
>>>         Unused by Rtr frame. That is set to 0 in Rtr frame 
>>>       
>> and ignored
>>     
>>>         by responder.
>>>
>>>
>>>      
>>>
>>>
>>>         All other bits M,C,R and remainder of Res treated 
>>>       
>> as in MPA ver 1.
>>     
>>>        
>>>
>>>
>>>         Rtr frame adhere to C bit as specified in Rep frame
>>>
>>>
>>>       
>> First, the RTR frame _must_ be an FPDU for this to work.  
>> Thus it violates the DDP/RDMAP specs because it is an known 
>> DDP/RDMAP opcode.
>>
>> Second, assuming the RTR frame is sent as an FPDU, then this 
>> won't work with existing RNIC HW.  The HW will post an async 
>> error because the incoming DDP/RDMAP opcode is unknown.
>>
>> The only way I see that we can fix this for the existing rnic 
>> HW is to come up with some way to send a valid RDMAP message 
>> from the initiator to the responder under the covers -and- 
>> have the responder only indicate that the connection is 
>> established when that FPDU is received.
>>
>> Chelsio cannot support this hack via a 0B write, but the 
>> could support a 0B read or send/recv exchange.  But as you 
>> indicate, this is very painful and perhaps impossible to do 
>> without impacting the ULP and breaking verbs semantics.
>>
>> (that's why we punted on this a year ago :)
>>
>>
>> Steve.
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>