[openib-general] RFC on CM error handling

Fri Jan 21 14:29:35 PST 2005

Libor Michalek wrote:
>> From the REP callback, even if the call to send an RTU is successful, 
>>a REJ could still be received.  (The remote side timed out waiting for 
>>the RTU.)  Locally, the cm_id state went from REP_RCVD to ESTABLISHED 
>>to TIMEWAIT.  Given this, it seems that there are missing state 
>>transitions in the spec handling a REJ from REP_RCVD or MRA_REP_SENT 
>>states, which would drive the state back to IDLE.
> 
>   I think this state transition is ignored, since data transfer will
> detect the situation. After the RTU is sent and the connection is
> transfered to ESTALISHED, the QP is transitioned to RTS, a posted
> send will result in a error completion, since the remote QP has been
> destroyed and will either not ack or nack the data. Applications that
> care about detecting that a connection, which is not transfering data,
> is healthy should perform zero byte RDMA writes...

I agree that the CM could ignore this transition.  The CM can probably 
ignore all REJ messages and rely on timeouts (which is why I haven't 
coded that portion yet...).  Long term I think the CM should attempt to 
handle REJ in all valid states.  From the CM's perspective, the 
required effort appears to be adding another case in a switch statement.

Along these same lines, there are a few more missing state transitions 
from the spec.  A client can receive a DREQ from the REP_SENT state, 
receive a REP from DREQ_SENT, and receive a DREP from DREQ_RCVD.  The 
CM will handle these by going from REP_SENT to DREQ_RCVD, resending the 
DREQ, or transitioning to TIMEWAIT, respectively.

- Sean