[ofa-general] Re: iWARP peer-to-peer CM proposal
Glenn Grundstrom
ggrundstrom at NetEffect.com
Sun Nov 25 17:59:40 PST 2007
>
> Kanevsky, Arkady wrote:
> > Very good points.
> > Thanks Steve.
> >
> > If we can do unsignalled 0-size RDMA Read with "bogus"
> S-tag this may
> > work better.
> > Yes, it will require IRD not to be 0 set at Responder.
> > Ditto ORD of at least 1 on Responder.
> > There is no need to have extra CQ entry on either side for it.
> > It is only needed for error path.
> > So this will only be needed if Sender posted the full queue
> of sends.
> > But it can not post anything because CM will not let it know that
> > connection is established.
> >
> >
> Well, actually, I think the ULP _can_ post before establishing the
> connection. But I guess we can define the semantics such that
> applications using the rdma-cm interface must adhere to
> whatever we need
> to make this hack work.
>
> Q: are there apps using the rdma-cm out there today that
> pre-post SQ WRs
> before getting a ESTABLISHED event?
>
> Steve.
ULPs are allowed to post prior to establishing the connection, but I
can't name any that operate this way. Prohibiting applications
that use the rdma_cm directly from pre-posting is okay, but what
about ULP's over other ULP's (i.e. MPI over uDAPL). How can/will
this be handled?
Glenn.
> > Happy Thanksgiving,
> >
> > Arkady Kanevsky email: arkady at netapp.com
> > Network Appliance Inc. phone: 781-768-5395
> > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195
> > Waltham, MA 02451 central phone: 781-768-5300
> >
> >
> >
> >> -----Original Message-----
> >> From: Steve Wise [mailto:swise at opengridcomputing.com]
> >> Sent: Wednesday, November 21, 2007 1:07 PM
> >> To: Kanevsky, Arkady
> >> Cc: Glenn Grundstrom; Leonid Grossman; openib-general at openib.org
> >> Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal
> >>
> >> Comments in-line below...
> >>
> >>
> >> Kanevsky, Arkady wrote:
> >>
> >>> Group,
> >>>
> >>>
> >>> below is proposal on how to resolve peer-to-peer
> iWARP CM issue
> >>> discovered at interop event.
> >>>
> >>>
> >>> The main issue is that MPA spec (relevant portion of
> >>>
> >> IETF RFC 5044
> >>
> >>> is below) require that
> >>>
> >>>
> >>> connection initiator send first message over the
> >>>
> >> established connection.
> >>
> >>> Multiple MPI implementations and several other apps use
> >>>
> >> peer-to-peer
> >>
> >>> model.
> >>>
> >>>
> >>> So rather then forcing all of them to do it on their
> >>>
> >> own, which will
> >>
> >>> not help with
> >>>
> >>>
> >>> interop between different implementations, the goal
> is to extend
> >>> lower layers to provide it.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Our first idea was to leave MPA protocol untouched and
> >>>
> >> try to solve
> >>
> >>> this problem
> >>>
> >>>
> >>> in iw_cm. But there are too many complications to it.
> First, in
> >>> order to adhere to RFC5044
> >>>
> >>>
> >>> initiator must send first FPDU and responder process
> >>>
> >> it. But since
> >>
> >>> the connection is already
> >>>
> >>>
> >>> established processing FPDU involves ULP on whose behalf the
> >>> connection is created.
> >>>
> >>>
> >>> So either initiator sends a message which generates
> >>>
> >> completion on
> >>
> >>> responder CQ, thus visible
> >>>
> >>>
> >>> to ULP, or not.
> >>>
> >>
> >>
> >>> In the later case, the only op which can do it is
> >>> RDMA one, which means
> >>>
> >>>
> >>> that responder somehow provided initiator S-tag which
> >>>
> >> it can use.
> >>
> >>> So, this is an extension
> >>>
> >>>
> >>> to MPA, probably using private data. And that responder upon
> >>> receiving it destroy this S-tag.
> >>>
> >>>
> >>> In any case this is an extension of MPA.
> >>>
> >>>
> >> This stag exchange isn't needed if this RDMA op is a 0B READ.
> >> The responder waits for that 0B read and only indicates the
> >> rdma connection is established to its ULP when it replies to
> >> the 0B read. In this scenario, the responder/server side
> >> doesn't consume any CQ resources.
> >> But it would require an IRD of at least 1 to be configured
> on the QP.
> >> The initiator still requires an SQ entry, and possibly a CQ
> >> entry, for initiating the 0B read and handling completion.
> >> But its perhaps a little less painful than doing a SEND/RECV
> >> exchange. The read wr could be unsignaled so that it won't
> >> generate a CQE. But it still consumes an SQ WR slot so the
> >> SQ would have to be sized to allow this extra WR. And I guess
> >> the CQ would also need to be sized accordingly in case the
> >> read failed.
> >>
> >>
> >>> In the former, Send is used but this requires a buffer
> >>>
> >> to be posted
> >>
> >>> to CQ. But since
> >>>
> >>>
> >>> the same CQ (or SharedCQ) can be used by other
> >>>
> >> connections at the
> >>
> >>> same time it can cause
> >>>
> >>>
> >>> the responder CM posted buffer to be consumed by other
> >>>
> >> connection.
> >>
> >>> This is not acceptable.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> So new we consider extension to MPA protocol.
> >>>
> >>>
> >>> The goal is to be completely backwards compatible to
> >>>
> >> existing version 1.
> >>
> >>> In a nutshell, use a "flag" in the MPA request message which
> >>> indicates that
> >>>
> >>>
> >>> "ready to receive" message will be send by requestor upon
> >>> receiving
> >>>
> >>>
> >>> MPA response message with connection acceptance.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> here are the changes to IETF RFC5044
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 1. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> >>>
> >> 2 3 4 5 6 7 8
> >>
> >>> 9 0 1
> >>>
> >>>
> >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 0
> >>
> >>> | | + Key (16 bytes containing "MPA ID Req Frame") + 4
> >>>
> >> | (4D 50 41
> >>
> >>> 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + Or (16
> >>>
> >> bytes containing
> >>
> >>> "MPA ID Rep Frame") + 8 | (4D 50 41 20 49 44 20 52 65
> >>>
> >> 70 20 46 72 61
> >>
> >>> 6D 65) | + Or (16 bytes containing "MPA ID Rtr Frame")
> >>>
> >> + 12 | (4D 50
> >>
> >>> 41 20 49 44 20 52 74 52 20 46 72 61 6D 65) | +
> >>>
> >>>
> >>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16
> >>
> >>> |M|C|R|S| Res | Rev | PD_Length |
> >>>
> >>>
> >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
> >>
> >>> | ~ ~ ~ Private Data ~ | | |
> >>>
> >> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
> >>
> >>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 2. S: indicator in the Req frame whether or not
> >>>
> >> Requestor will send
> >>
> >>> Rtr frame.
> >>>
> >>>
> >>> In Req frame, if set to 1 then Rtr frame will
> be sent if
> >>> responder
> >>>
> >>>
> >>> sends Rep frame with accept bit set. 0 indicate
> >>>
> >> that Rtr frame
> >>
> >>> will not be sent.
> >>>
> >>>
> >>> In Rep frame, 0 means that Responder cannot support
> >>>
> >> Rtr frame,
> >>
> >>> while 1 that it is and is waiting for it.
> >>>
> >>>
> >>> (While my preference is to handle this as MPA
> >>>
> >> protocol version
> >>
> >>> matching rules,
> >>>
> >>>
> >>> proposed method will provide complete backwards
> >>>
> >> compatibility)
> >>
> >>> Unused by Rtr frame. That is set to 0 in Rtr frame
> >>>
> >> and ignored
> >>
> >>> by responder.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> All other bits M,C,R and remainder of Res treated
> >>>
> >> as in MPA ver 1.
> >>
> >>>
> >>>
> >>>
> >>> Rtr frame adhere to C bit as specified in Rep frame
> >>>
> >>>
> >>>
> >> First, the RTR frame _must_ be an FPDU for this to work.
> >> Thus it violates the DDP/RDMAP specs because it is an known
> >> DDP/RDMAP opcode.
> >>
> >> Second, assuming the RTR frame is sent as an FPDU, then this
> >> won't work with existing RNIC HW. The HW will post an async
> >> error because the incoming DDP/RDMAP opcode is unknown.
> >>
> >> The only way I see that we can fix this for the existing rnic
> >> HW is to come up with some way to send a valid RDMAP message
> >> from the initiator to the responder under the covers -and-
> >> have the responder only indicate that the connection is
> >> established when that FPDU is received.
> >>
> >> Chelsio cannot support this hack via a 0B write, but the
> >> could support a 0B read or send/recv exchange. But as you
> >> indicate, this is very painful and perhaps impossible to do
> >> without impacting the ULP and breaking verbs semantics.
> >>
> >> (that's why we punted on this a year ago :)
> >>
> >>
> >> Steve.
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit
> >> http://openib.org/mailman/listinfo/openib-general
> >>
> >>
>
More information about the general
mailing list