[openib-general] RFC on CM error handling
Libor Michalek
libor at topspin.com
Fri Jan 21 13:53:21 PST 2005
On Fri, Jan 21, 2005 at 11:55:25AM -0800, Sean Hefty wrote:
> Libor Michalek wrote:
> >>An example issue that I'm thinking of is a user gets a reply callback.
> >> A reject is then received by the CM, and a second callback to the
> >>user is initiated. If the user tries to send an RTU, the call will
> >>fail since the cm_id is in an invalid state. If the user then returns
> >>-1 from the callback, the CM will destroy the cm_id. The destruction
> >>will block while the reject callback completes. Since the user
> >>returned -1 from the reply callback, they may not be ready to handle
> >>another callback.
> >>
> >>The fix that I'm working on should still allow multithreaded operation
> >>inside the CM, but callbacks to the user will be serialized. If a user
> >>returns a non-zero value from a callback, no additional callbacks will
> >>be generated.
> >
> >
> > OK, that's the behaviour I would expect. However, in the example, even
> > if the user returns 0 from the REP callback, I wouldn't expect to see
> > the REJ after the REP has been processed. (or after the RTU has been sent)
> > The CM states updates for a connection and resulting callbacks would be
> > serialized, so the REJ after the REP would be discarded since it was
> > received in a CM state which does not allow rejects. Or is this incorrect?
>
> From the REP callback, even if the call to send an RTU is successful,
> a REJ could still be received. (The remote side timed out waiting for
> the RTU.) Locally, the cm_id state went from REP_RCVD to ESTABLISHED
> to TIMEWAIT. Given this, it seems that there are missing state
> transitions in the spec handling a REJ from REP_RCVD or MRA_REP_SENT
> states, which would drive the state back to IDLE.
I think this state transition is ignored, since data transfer will
detect the situation. After the RTU is sent and the connection is
transfered to ESTALISHED, the QP is transitioned to RTS, a posted
send will result in a error completion, since the remote QP has been
destroyed and will either not ack or nack the data. Applications that
care about detecting that a connection, which is not transfering data,
is healthy should perform zero byte RDMA writes...
-Libor
More information about the general
mailing list