[openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT)

Tue Aug 1 23:44:05 PDT 2006

Eric Barton wrote:
> I've had a report of rdma_connect() failing with a callback event type of
> RDMA_CM_EVENT_UNREACHABLE and status -ETIMEDOUT although the peer node was
> up and running at the time.
> 
> It seems this can be reproduced as follows...
> 
> 1. Establish a connection between nodes A and B
> 
> 2. Reboot node A
> 
> 3. Start establishing a new connection from node A to node B
> 
> 4. After a timeout, the CM callback occurs as described.
> 
> Could this happen with a buggy SM?  Are there some good places in the
> OpenFabrics stack to add printks to help point the finger (or can some
> existing debug/trace code be enabled)?

Eric,

My guess this is related to the CM not the SM.

I think there is a chance that the CM on node B does not treat the REQ 
sent by A after the reboot as "stale connection" situation and hence 
just **silently** dtop it, that is not REJ is sent.

Adding prints in the if/else below within core/cm.c :: cm_match_req() 
would help you to figure out if the direction i suggest indeed is the 
one for you to hunt.

I am not familiar enough with the generation of the CM IDs, but my basic 
thinking is that generating them randomly should solve it. In case the 
IDs are started from some seed value and then each new generated id is 
the current value plus one, having the initial seed being equal to 
jiffies instead of to some constant, should be fine.

Or.

if (timewait_info) {
    cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
                              timewait_info->work.remote_id);
    spin_unlock_irqrestore(&cm.lock, flags);
    if (cur_cm_id_priv) {
           cm_dup_req_handler(work, cur_cm_id_priv);
           cm_deref_id(cur_cm_id_priv);
    } else
           cm_issue_rej(work->port, work->mad_recv_wc,
                        IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
                        NULL, 0);
          goto error;
   }