[ofa-general] Re: iWARP peer-to-peer CM proposal
Kanevsky, Arkady
Arkady.Kanevsky at netapp.com
Tue Nov 27 06:54:05 PST 2007
ULP can post recvs before connection is established but not to send
queue
prior to connection establishment.
Arkady Kanevsky email: arkady at netapp.com
Network Appliance Inc. phone: 781-768-5395
1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195
Waltham, MA 02451 central phone: 781-768-5300
> -----Original Message-----
> From: Glenn Grundstrom [mailto:ggrundstrom at NetEffect.com]
> Sent: Sunday, November 25, 2007 9:00 PM
> To: Steve Wise; Kanevsky, Arkady
> Cc: Leonid Grossman; openib-general at openib.org
> Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
>
> >
> > Kanevsky, Arkady wrote:
> > > Very good points.
> > > Thanks Steve.
> > >
> > > If we can do unsignalled 0-size RDMA Read with "bogus"
> > S-tag this may
> > > work better.
> > > Yes, it will require IRD not to be 0 set at Responder.
> > > Ditto ORD of at least 1 on Responder.
> > > There is no need to have extra CQ entry on either side for it.
> > > It is only needed for error path.
> > > So this will only be needed if Sender posted the full queue
> > of sends.
> > > But it can not post anything because CM will not let it know that
> > > connection is established.
> > >
> > >
> > Well, actually, I think the ULP _can_ post before establishing the
> > connection. But I guess we can define the semantics such that
> > applications using the rdma-cm interface must adhere to whatever we
> > need to make this hack work.
> >
> > Q: are there apps using the rdma-cm out there today that
> pre-post SQ
> > WRs before getting a ESTABLISHED event?
> >
> > Steve.
>
> ULPs are allowed to post prior to establishing the
> connection, but I can't name any that operate this way.
> Prohibiting applications that use the rdma_cm directly from
> pre-posting is okay, but what about ULP's over other ULP's
> (i.e. MPI over uDAPL). How can/will this be handled?
>
> Glenn.
>
>
> > > Happy Thanksgiving,
> > >
> > > Arkady Kanevsky email: arkady at netapp.com
> > > Network Appliance Inc. phone: 781-768-5395
> > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195
> > > Waltham, MA 02451 central phone: 781-768-5300
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: Steve Wise [mailto:swise at opengridcomputing.com]
> > >> Sent: Wednesday, November 21, 2007 1:07 PM
> > >> To: Kanevsky, Arkady
> > >> Cc: Glenn Grundstrom; Leonid Grossman; openib-general at openib.org
> > >> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
> > >>
> > >> Comments in-line below...
> > >>
> > >>
> > >> Kanevsky, Arkady wrote:
> > >>
> > >>> Group,
> > >>>
> > >>>
> > >>> below is proposal on how to resolve peer-to-peer
> > iWARP CM issue
> > >>> discovered at interop event.
> > >>>
> > >>>
> > >>> The main issue is that MPA spec (relevant portion of
> > >>>
> > >> IETF RFC 5044
> > >>
> > >>> is below) require that
> > >>>
> > >>>
> > >>> connection initiator send first message over the
> > >>>
> > >> established connection.
> > >>
> > >>> Multiple MPI implementations and several other apps use
> > >>>
> > >> peer-to-peer
> > >>
> > >>> model.
> > >>>
> > >>>
> > >>> So rather then forcing all of them to do it on their
> > >>>
> > >> own, which will
> > >>
> > >>> not help with
> > >>>
> > >>>
> > >>> interop between different implementations, the goal
> > is to extend
> > >>> lower layers to provide it.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Our first idea was to leave MPA protocol untouched and
> > >>>
> > >> try to solve
> > >>
> > >>> this problem
> > >>>
> > >>>
> > >>> in iw_cm. But there are too many complications to it.
> > First, in
> > >>> order to adhere to RFC5044
> > >>>
> > >>>
> > >>> initiator must send first FPDU and responder process
> > >>>
> > >> it. But since
> > >>
> > >>> the connection is already
> > >>>
> > >>>
> > >>> established processing FPDU involves ULP on whose behalf the
> > >>> connection is created.
> > >>>
> > >>>
> > >>> So either initiator sends a message which generates
> > >>>
> > >> completion on
> > >>
> > >>> responder CQ, thus visible
> > >>>
> > >>>
> > >>> to ULP, or not.
> > >>>
> > >>
> > >>
> > >>> In the later case, the only op which can do it is
> > >>> RDMA one, which means
> > >>>
> > >>>
> > >>> that responder somehow provided initiator S-tag which
> > >>>
> > >> it can use.
> > >>
> > >>> So, this is an extension
> > >>>
> > >>>
> > >>> to MPA, probably using private data. And that responder upon
> > >>> receiving it destroy this S-tag.
> > >>>
> > >>>
> > >>> In any case this is an extension of MPA.
> > >>>
> > >>>
> > >> This stag exchange isn't needed if this RDMA op is a 0B READ.
> > >> The responder waits for that 0B read and only indicates
> the rdma
> > >> connection is established to its ULP when it replies to the 0B
> > >> read. In this scenario, the responder/server side
> doesn't consume
> > >> any CQ resources.
> > >> But it would require an IRD of at least 1 to be configured
> > on the QP.
> > >> The initiator still requires an SQ entry, and possibly a
> CQ entry,
> > >> for initiating the 0B read and handling completion.
> > >> But its perhaps a little less painful than doing a SEND/RECV
> > >> exchange. The read wr could be unsignaled so that it won't
> > >> generate a CQE. But it still consumes an SQ WR slot so the SQ
> > >> would have to be sized to allow this extra WR. And I
> guess the CQ
> > >> would also need to be sized accordingly in case the read failed.
> > >>
> > >>
> > >>> In the former, Send is used but this requires a buffer
> > >>>
> > >> to be posted
> > >>
> > >>> to CQ. But since
> > >>>
> > >>>
> > >>> the same CQ (or SharedCQ) can be used by other
> > >>>
> > >> connections at the
> > >>
> > >>> same time it can cause
> > >>>
> > >>>
> > >>> the responder CM posted buffer to be consumed by other
> > >>>
> > >> connection.
> > >>
> > >>> This is not acceptable.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> So new we consider extension to MPA protocol.
> > >>>
> > >>>
> > >>> The goal is to be completely backwards compatible to
> > >>>
> > >> existing version 1.
> > >>
> > >>> In a nutshell, use a "flag" in the MPA request message which
> > >>> indicates that
> > >>>
> > >>>
> > >>> "ready to receive" message will be send by requestor upon
> > >>> receiving
> > >>>
> > >>>
> > >>> MPA response message with connection acceptance.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> here are the changes to IETF RFC5044
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> 1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> > >>>
> > >> 2 3 4 5 6 7 8
> > >>
> > >>> 9 0 1
> > >>>
> > >>>
> > >>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
> > >>
> > >>> | | + Key (16 bytes containing "MPA ID Req Frame") + 4
> > >>>
> > >> | (4D 50 41
> > >>
> > >>> 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16
> > >>>
> > >> bytes containing
> > >>
> > >>> "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65
> > >>>
> > >> 70 20 46 72 61
> > >>
> > >>> 6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame")
> > >>>
> > >> + 12 | (4D 50
> > >>
> > >>> 41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
> > >>>
> > >>>
> > >>
> > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
> > >>
> > >>> |M|C|R|S| Res | Rev | PD_Length |
> > >>>
> > >>>
> > >>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
> > >>
> > >>> | ~ ~ ~ Private Data ~ | | |
> > >>>
> > >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
> > >>
> > >>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> 2. S: indicator in the Req frame whether or not
> > >>>
> > >> Requestor will send
> > >>
> > >>> Rtr frame.
> > >>>
> > >>>
> > >>> In Req frame, if set to 1 then Rtr frame will
> > be sent if
> > >>> responder
> > >>>
> > >>>
> > >>> sends Rep frame with accept bit set. 0 indicate
> > >>>
> > >> that Rtr frame
> > >>
> > >>> will not be sent.
> > >>>
> > >>>
> > >>> In Rep frame, 0 means that Responder cannot support
> > >>>
> > >> Rtr frame,
> > >>
> > >>> while 1 that it is and is waiting for it.
> > >>>
> > >>>
> > >>> (While my preference is to handle this as MPA
> > >>>
> > >> protocol version
> > >>
> > >>> matching rules,
> > >>>
> > >>>
> > >>> proposed method will provide complete backwards
> > >>>
> > >> compatibility)
> > >>
> > >>> Unused by Rtr frame. That is set to 0 in Rtr frame
> > >>>
> > >> and ignored
> > >>
> > >>> by responder.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> All other bits M,C,R and remainder of Res treated
> > >>>
> > >> as in MPA ver 1.
> > >>
> > >>>
> > >>>
> > >>>
> > >>> Rtr frame adhere to C bit as specified in Rep frame
> > >>>
> > >>>
> > >>>
> > >> First, the RTR frame _must_ be an FPDU for this to work.
> > >> Thus it violates the DDP/RDMAP specs because it is an known
> > >> DDP/RDMAP opcode.
> > >>
> > >> Second, assuming the RTR frame is sent as an FPDU, then
> this won't
> > >> work with existing RNIC HW. The HW will post an async error
> > >> because the incoming DDP/RDMAP opcode is unknown.
> > >>
> > >> The only way I see that we can fix this for the existing
> rnic HW is
> > >> to come up with some way to send a valid RDMAP message from the
> > >> initiator to the responder under the covers -and- have the
> > >> responder only indicate that the connection is established when
> > >> that FPDU is received.
> > >>
> > >> Chelsio cannot support this hack via a 0B write, but the could
> > >> support a 0B read or send/recv exchange. But as you
> indicate, this
> > >> is very painful and perhaps impossible to do without
> impacting the
> > >> ULP and breaking verbs semantics.
> > >>
> > >> (that's why we punted on this a year ago :)
> > >>
> > >>
> > >> Steve.
> > >>
> > >> _______________________________________________
> > >> general mailing list
> > >> general at lists.openfabrics.org
> > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >>
> > >> To unsubscribe, please visit
> > >> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >>
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list