[openib-general] RDMA_CM_EVENT_UNREACHABLE(-ETIMEDOUT)
Or Gerlitz
ogerlitz at voltaire.com
Tue Aug 1 23:44:05 PDT 2006
Eric Barton wrote:
> I've had a report of rdma_connect() failing with a callback event type of
> RDMA_CM_EVENT_UNREACHABLE and status -ETIMEDOUT although the peer node was
> up and running at the time.
>
> It seems this can be reproduced as follows...
>
> 1. Establish a connection between nodes A and B
>
> 2. Reboot node A
>
> 3. Start establishing a new connection from node A to node B
>
> 4. After a timeout, the CM callback occurs as described.
>
> Could this happen with a buggy SM? Are there some good places in the
> OpenFabrics stack to add printks to help point the finger (or can some
> existing debug/trace code be enabled)?
Eric,
My guess this is related to the CM not the SM.
I think there is a chance that the CM on node B does not treat the REQ
sent by A after the reboot as "stale connection" situation and hence
just **silently** dtop it, that is not REJ is sent.
Adding prints in the if/else below within core/cm.c :: cm_match_req()
would help you to figure out if the direction i suggest indeed is the
one for you to hunt.
I am not familiar enough with the generation of the CM IDs, but my basic
thinking is that generating them randomly should solve it. In case the
IDs are started from some seed value and then each new generated id is
the current value plus one, having the initial seed being equal to
jiffies instead of to some constant, should be fine.
Or.
if (timewait_info) {
cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
timewait_info->work.remote_id);
spin_unlock_irqrestore(&cm.lock, flags);
if (cur_cm_id_priv) {
cm_dup_req_handler(work, cur_cm_id_priv);
cm_deref_id(cur_cm_id_priv);
} else
cm_issue_rej(work->port, work->mad_recv_wc,
IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
NULL, 0);
goto error;
}
More information about the general
mailing list