[openib-general] RFC on CM error handling
Sean Hefty
mshefty at ichips.intel.com
Fri Jan 21 14:29:35 PST 2005
Libor Michalek wrote:
>> From the REP callback, even if the call to send an RTU is successful,
>>a REJ could still be received. (The remote side timed out waiting for
>>the RTU.) Locally, the cm_id state went from REP_RCVD to ESTABLISHED
>>to TIMEWAIT. Given this, it seems that there are missing state
>>transitions in the spec handling a REJ from REP_RCVD or MRA_REP_SENT
>>states, which would drive the state back to IDLE.
>
> I think this state transition is ignored, since data transfer will
> detect the situation. After the RTU is sent and the connection is
> transfered to ESTALISHED, the QP is transitioned to RTS, a posted
> send will result in a error completion, since the remote QP has been
> destroyed and will either not ack or nack the data. Applications that
> care about detecting that a connection, which is not transfering data,
> is healthy should perform zero byte RDMA writes...
I agree that the CM could ignore this transition. The CM can probably
ignore all REJ messages and rely on timeouts (which is why I haven't
coded that portion yet...). Long term I think the CM should attempt to
handle REJ in all valid states. From the CM's perspective, the
required effort appears to be adding another case in a switch statement.
Along these same lines, there are a few more missing state transitions
from the spec. A client can receive a DREQ from the REP_SENT state,
receive a REP from DREQ_SENT, and receive a DREP from DREQ_RCVD. The
CM will handle these by going from REP_SENT to DREQ_RCVD, resending the
DREQ, or transitioning to TIMEWAIT, respectively.
- Sean
More information about the general
mailing list