[ofa-general] Re: iWARP peer-to-peer CM proposal

Tue Nov 27 15:58:33 PST 2007

Caitlin Bestler wrote:
> On Nov 27, 2007 3:13 PM, Steve Wise <swise at opengridcomputing.com> wrote:
>> Caitlin Bestler wrote:
>>> On Nov 27, 2007 6:54 AM, Kanevsky, Arkady <Arkady.Kanevsky at netapp.com> wrote:
>>>> ULP can post recvs before connection is established but not to send
>>>> queue prior to connection establishment.
>>>>
>>>
>>> ULP can post sends only after it is notified that the connection is established.
>>>
>>> The issue is when the iWARP layer can issue this notification.
>>>
>>> If the MPA layer implements fencing on its own, then the notification can
>>> be provided immediately after the MPA Request/Response exchange.
>>>
>>> If not, it must wait for the first MPA frame. The problem is that
>>> implementations that adhere to closely to the RDMAC verbs can obtain
>>> no information about the connection unless there is a CQE producing event.
>> The idea for this "hack" is that the passive side (the side that sends
>> the MPA response) will hold off posting the ESTABLISHED event to the
>> rdma-cm ULP until after it receives this 0B Read Request from the client...
>>
> 
> The problem is that this solution is being applied at the wrong layer.
> 
> MPA is not the source of the problem, but rather the RDMAC layer verbs.
> The solution needs to be a verb-layer solution, not an MPA layer solution.
> 

This isn't being solved at the MPA layer.  It being solved as a protocol 
  exchange done after the MPA exchanges (and after the connections are 
transitioned into FPDU mode.  Remeber: This is a _hack_ to get our 
current generation of rnics to support peer-to-peer _without_ impacting 
the rdma applications (like IMPI and OMPI).

> Steve's last comment states the problem well: we are trying to enable the
> Verbs layer on the Passive side to generate the Established event, and
> if at all possible to do so in a way that places no requirements on the
> application layer.
> 
> I believe it is possible to do so without making any modifications to MPA.
> 

Yes.

> The MPA protocol requirement is a safeguard against receiving an MPA
> Frame before the MPA Response frame. MPA does not have or need an
> RTR message, because the MPA RFC allows *any* MPA frame from the
> active side to effectively acknowledge receipt of the MPA Response.
> 

Yes, but it puts the onus on the ULP to deal with this.  In our current 
implementation model, that ULP is the top end application.

> That includes a zero-length RDMA Write.
> 
> An iWARP implementation can (perhaps SHOULD) implement an "MPA
> Fenced" state on the passive side that is cleared on receipt of any MPA
> frame. With such a "MPA Fence" feature, the CM layer can generate the
> "Connection Established" event as soon as it sends the MPA Response
> and the Passive-side ULP will be able to post to the SQ, the messages
> just won't go the wire until something is received.
> 
> Meanwhile the Active Side must ensure that *some* MPA frame is sent
> immediately after the MPA Response is received. If it has traffic ready to
> go it can simply send that. If it does not, it can use a zero-length write.
> A zero-length write is totally transparent to the ULP at both ends.
> 
> But that will only work for *some* implementations. On others a zero
> length RDMA Read is needed to unjam things. That's almost transparent,
> but not totally so since it temporarily uses an RDMA Read credit.
>

Right.  Chelsio needs a Read vs a Write because the FW and driver don't 
detect the incoming 0B write so they cannot drive the ESTABLISHED event 
on that.

> And while nobody has spoken up to say *they* have that problem, I would
> not be surprised if there are implementations where nothing less than a full
> ULP "nop" message will suffice.
> 
> So keeping the fix at the verbs layer, and allowing the minimal extra
> effort to be controlled by the Passive layer itself, suggests that the
> Passive side simply encode its MPA-unjam-action-required in the
> OFA standardized portion of the Private Data. Encodings would
> include:
> 
> - Any MPA Frame, including a zero-length RDMA Write will unjam
>   the passive side SendQ.
> - An untagged message or a zero-length RDMA Read will work.
> - Only an untagged message will work.
> 

So you're advocating adding a standardized header to the private data to 
indicate what the passive side needs.  While we're at it, lets add in 
ORD/IRD ;-)

> In the latter cases the middleware will have to play games with standin
> receive WQEs and only posting the actual receive WQEs to the QP
> after the MPA fence has been unjammed. That isn't pretty, but if your
> hardware is fixed then it's either that or make the application deal with
> the problem. I have a hunch that the MPI developers would not like that
> option at all.
> 
> How this differs from what Arkady proposed is that it avoids making any
> changes to MPA, but instead only makes use of the OFA defined portion
> of the Private Data. Further it allows use of a zero-length RDMA Write
> when that is sufficient to break the MPA logjam. A zero-length RDMA
> Write, unlike a zero-length RDMA Read, is *totally* transparent to the ULP.

For the short term, I claim we just implement this as part of linux 
iwarp connection setup (mandating a 0B read be sent from the active 
side).  Your proposal to add meta-data to the private data requires a 
standards change anyway and is, IMO, the 2nd phase of this whole 
enchilada...

Steve.