[ofa-general] Re: iWARP peer-to-peer CM proposal
Steve Wise
swise at opengridcomputing.com
Fri Nov 23 07:35:37 PST 2007
Kanevsky, Arkady wrote:
> Very good points.
> Thanks Steve.
>
> If we can do unsignalled 0-size RDMA Read with "bogus" S-tag this may
> work better.
> Yes, it will require IRD not to be 0 set at Responder.
> Ditto ORD of at least 1 on Responder.
> There is no need to have extra CQ entry on either side for it.
> It is only needed for error path.
> So this will only be needed if Sender posted the full queue of sends.
> But it can not post anything because CM will not let it know that
> connection is established.
>
>
Well, actually, I think the ULP _can_ post before establishing the
connection. But I guess we can define the semantics such that
applications using the rdma-cm interface must adhere to whatever we need
to make this hack work.
Q: are there apps using the rdma-cm out there today that pre-post SQ WRs
before getting a ESTABLISHED event?
Steve.
> Happy Thanksgiving,
>
> Arkady Kanevsky email: arkady at netapp.com
> Network Appliance Inc. phone: 781-768-5395
> 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195
> Waltham, MA 02451 central phone: 781-768-5300
>
>
>
>> -----Original Message-----
>> From: Steve Wise [mailto:swise at opengridcomputing.com]
>> Sent: Wednesday, November 21, 2007 1:07 PM
>> To: Kanevsky, Arkady
>> Cc: Glenn Grundstrom; Leonid Grossman; openib-general at openib.org
>> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
>>
>> Comments in-line below...
>>
>>
>> Kanevsky, Arkady wrote:
>>
>>> Group,
>>>
>>>
>>> below is proposal on how to resolve peer-to-peer iWARP CM issue
>>> discovered at interop event.
>>>
>>>
>>> The main issue is that MPA spec (relevant portion of
>>>
>> IETF RFC 5044
>>
>>> is below) require that
>>>
>>>
>>> connection initiator send first message over the
>>>
>> established connection.
>>
>>> Multiple MPI implementations and several other apps use
>>>
>> peer-to-peer
>>
>>> model.
>>>
>>>
>>> So rather then forcing all of them to do it on their
>>>
>> own, which will
>>
>>> not help with
>>>
>>>
>>> interop between different implementations, the goal is to extend
>>> lower layers to provide it.
>>>
>>>
>>>
>>>
>>>
>>> Our first idea was to leave MPA protocol untouched and
>>>
>> try to solve
>>
>>> this problem
>>>
>>>
>>> in iw_cm. But there are too many complications to it. First, in
>>> order to adhere to RFC5044
>>>
>>>
>>> initiator must send first FPDU and responder process
>>>
>> it. But since
>>
>>> the connection is already
>>>
>>>
>>> established processing FPDU involves ULP on whose behalf the
>>> connection is created.
>>>
>>>
>>> So either initiator sends a message which generates
>>>
>> completion on
>>
>>> responder CQ, thus visible
>>>
>>>
>>> to ULP, or not.
>>>
>>
>>
>>> In the later case, the only op which can do it is
>>> RDMA one, which means
>>>
>>>
>>> that responder somehow provided initiator S-tag which
>>>
>> it can use.
>>
>>> So, this is an extension
>>>
>>>
>>> to MPA, probably using private data. And that responder upon
>>> receiving it destroy this S-tag.
>>>
>>>
>>> In any case this is an extension of MPA.
>>>
>>>
>> This stag exchange isn't needed if this RDMA op is a 0B READ.
>> The responder waits for that 0B read and only indicates the
>> rdma connection is established to its ULP when it replies to
>> the 0B read. In this scenario, the responder/server side
>> doesn't consume any CQ resources.
>> But it would require an IRD of at least 1 to be configured on the QP.
>> The initiator still requires an SQ entry, and possibly a CQ
>> entry, for initiating the 0B read and handling completion.
>> But its perhaps a little less painful than doing a SEND/RECV
>> exchange. The read wr could be unsignaled so that it won't
>> generate a CQE. But it still consumes an SQ WR slot so the
>> SQ would have to be sized to allow this extra WR. And I guess
>> the CQ would also need to be sized accordingly in case the
>> read failed.
>>
>>
>>> In the former, Send is used but this requires a buffer
>>>
>> to be posted
>>
>>> to CQ. But since
>>>
>>>
>>> the same CQ (or SharedCQ) can be used by other
>>>
>> connections at the
>>
>>> same time it can cause
>>>
>>>
>>> the responder CM posted buffer to be consumed by other
>>>
>> connection.
>>
>>> This is not acceptable.
>>>
>>>
>>>
>>>
>>>
>>> So new we consider extension to MPA protocol.
>>>
>>>
>>> The goal is to be completely backwards compatible to
>>>
>> existing version 1.
>>
>>> In a nutshell, use a "flag" in the MPA request message which
>>> indicates that
>>>
>>>
>>> "ready to receive" message will be send by requestor upon
>>> receiving
>>>
>>>
>>> MPA response message with connection acceptance.
>>>
>>>
>>>
>>>
>>>
>>> here are the changes to IETF RFC5044
>>>
>>>
>>>
>>>
>>>
>>> 1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>>>
>> 2 3 4 5 6 7 8
>>
>>> 9 0 1
>>>
>>>
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
>>
>>> | | + Key (16 bytes containing "MPA ID Req Frame") + 4
>>>
>> | (4D 50 41
>>
>>> 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16
>>>
>> bytes containing
>>
>>> "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65
>>>
>> 70 20 46 72 61
>>
>>> 6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame")
>>>
>> + 12 | (4D 50
>>
>>> 41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
>>>
>>>
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
>>
>>> |M|C|R|S| Res | Rev | PD_Length |
>>>
>>>
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
>>
>>> | ~ ~ ~ Private Data ~ | | |
>>>
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
>>
>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>>
>>>
>>>
>>>
>>>
>>> 2. S: indicator in the Req frame whether or not
>>>
>> Requestor will send
>>
>>> Rtr frame.
>>>
>>>
>>> In Req frame, if set to 1 then Rtr frame will be sent if
>>> responder
>>>
>>>
>>> sends Rep frame with accept bit set. 0 indicate
>>>
>> that Rtr frame
>>
>>> will not be sent.
>>>
>>>
>>> In Rep frame, 0 means that Responder cannot support
>>>
>> Rtr frame,
>>
>>> while 1 that it is and is waiting for it.
>>>
>>>
>>> (While my preference is to handle this as MPA
>>>
>> protocol version
>>
>>> matching rules,
>>>
>>>
>>> proposed method will provide complete backwards
>>>
>> compatibility)
>>
>>> Unused by Rtr frame. That is set to 0 in Rtr frame
>>>
>> and ignored
>>
>>> by responder.
>>>
>>>
>>>
>>>
>>>
>>> All other bits M,C,R and remainder of Res treated
>>>
>> as in MPA ver 1.
>>
>>>
>>>
>>>
>>> Rtr frame adhere to C bit as specified in Rep frame
>>>
>>>
>>>
>> First, the RTR frame _must_ be an FPDU for this to work.
>> Thus it violates the DDP/RDMAP specs because it is an known
>> DDP/RDMAP opcode.
>>
>> Second, assuming the RTR frame is sent as an FPDU, then this
>> won't work with existing RNIC HW. The HW will post an async
>> error because the incoming DDP/RDMAP opcode is unknown.
>>
>> The only way I see that we can fix this for the existing rnic
>> HW is to come up with some way to send a valid RDMAP message
>> from the initiator to the responder under the covers -and-
>> have the responder only indicate that the connection is
>> established when that FPDU is received.
>>
>> Chelsio cannot support this hack via a 0B write, but the
>> could support a 0B read or send/recv exchange. But as you
>> indicate, this is very painful and perhaps impossible to do
>> without impacting the ULP and breaking verbs semantics.
>>
>> (that's why we punted on this a year ago :)
>>
>>
>> Steve.
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>>
More information about the general
mailing list