[ofa-general] Re: iWARP peer-to-peer CM proposal

Tue Nov 27 15:41:33 PST 2007

On Nov 27, 2007 3:13 PM, Steve Wise <swise at opengridcomputing.com> wrote:
>
> Caitlin Bestler wrote:
> > On Nov 27, 2007 6:54 AM, Kanevsky, Arkady <Arkady.Kanevsky at netapp.com> wrote:
> >> ULP can post recvs before connection is established but not to send
> >> queue prior to connection establishment.
> >>
> >
> >
> > ULP can post sends only after it is notified that the connection is established.
> >
> > The issue is when the iWARP layer can issue this notification.
> >
> > If the MPA layer implements fencing on its own, then the notification can
> > be provided immediately after the MPA Request/Response exchange.
> >
> > If not, it must wait for the first MPA frame. The problem is that
> > implementations that adhere to closely to the RDMAC verbs can obtain
> > no information about the connection unless there is a CQE producing event.
>
> The idea for this "hack" is that the passive side (the side that sends
> the MPA response) will hold off posting the ESTABLISHED event to the
> rdma-cm ULP until after it receives this 0B Read Request from the client...
>

The problem is that this solution is being applied at the wrong layer.

MPA is not the source of the problem, but rather the RDMAC layer verbs.
The solution needs to be a verb-layer solution, not an MPA layer solution.

Steve's last comment states the problem well: we are trying to enable the
Verbs layer on the Passive side to generate the Established event, and
if at all possible to do so in a way that places no requirements on the
application layer.

I believe it is possible to do so without making any modifications to MPA.

The MPA protocol requirement is a safeguard against receiving an MPA
Frame before the MPA Response frame. MPA does not have or need an
RTR message, because the MPA RFC allows *any* MPA frame from the
active side to effectively acknowledge receipt of the MPA Response.

That includes a zero-length RDMA Write.

An iWARP implementation can (perhaps SHOULD) implement an "MPA
Fenced" state on the passive side that is cleared on receipt of any MPA
frame. With such a "MPA Fence" feature, the CM layer can generate the
"Connection Established" event as soon as it sends the MPA Response
and the Passive-side ULP will be able to post to the SQ, the messages
just won't go the wire until something is received.

Meanwhile the Active Side must ensure that *some* MPA frame is sent
immediately after the MPA Response is received. If it has traffic ready to
go it can simply send that. If it does not, it can use a zero-length write.
A zero-length write is totally transparent to the ULP at both ends.

But that will only work for *some* implementations. On others a zero
length RDMA Read is needed to unjam things. That's almost transparent,
but not totally so since it temporarily uses an RDMA Read credit.

And while nobody has spoken up to say *they* have that problem, I would
not be surprised if there are implementations where nothing less than a full
ULP "nop" message will suffice.

So keeping the fix at the verbs layer, and allowing the minimal extra
effort to be controlled by the Passive layer itself, suggests that the
Passive side simply encode its MPA-unjam-action-required in the
OFA standardized portion of the Private Data. Encodings would
include:

- Any MPA Frame, including a zero-length RDMA Write will unjam
  the passive side SendQ.
- An untagged message or a zero-length RDMA Read will work.
- Only an untagged message will work.

In the latter cases the middleware will have to play games with standin
receive WQEs and only posting the actual receive WQEs to the QP
after the MPA fence has been unjammed. That isn't pretty, but if your
hardware is fixed then it's either that or make the application deal with
the problem. I have a hunch that the MPI developers would not like that
option at all.

How this differs from what Arkady proposed is that it avoids making any
changes to MPA, but instead only makes use of the OFA defined portion
of the Private Data. Further it allows use of a zero-length RDMA Write
when that is sufficient to break the MPA logjam. A zero-length RDMA
Write, unlike a zero-length RDMA Read, is *totally* transparent to the ULP.