[ofa-general] IB/cm: bug in stale connection detection logic?

Sean Hefty mshefty at ichips.intel.com
Mon May 21 09:34:45 PDT 2007


> 1. I see this in cm_match_req:
> 
>         timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
>         if (!timewait_info)
>                 timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
> 
>         if (timewait_info) {
>                 cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
>                                            timewait_info->work.remote_id);
>                 cm_cleanup_timewait(cm_id_priv->timewait_info);
>                 spin_unlock_irqrestore(&cm.lock, flags);
>                 if (cur_cm_id_priv) {
>                         cm_dup_req_handler(work, cur_cm_id_priv);
>                         cm_deref_id(cur_cm_id_priv);
>                 } else
>                         cm_issue_rej(work->port, work->mad_recv_wc,
>                                      IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
>                                      NULL, 0);
> 
> Note that cm_get_id is passed data from timewait_info and not from the request:
> thus, if the QPN in request matches QPN in an existing connection, this is
> mis-detected as a duplicate request even if the IDs do not match;
> thus, the request is dropped or "duplicate" reject is sent instead of
> a "stale connection" reject.
> 
> Am I missing something?

I think you may be right on the QPN check, so I'll look into it more. 
Note that a REQ doesn't carry the local ID, which is why cm_get_id 
doesn't use the IDs from the REQ.

> Suggestion:
> Why is an extra call to cm_get_id required to detect a duplicate?
> Shouldn't we just
> 
>         timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
> 	if (timewait_info) {
> 		/* handle duplicate */

This isn't necessarily a duplicate.  We need to check the state of the 
local connection endpoint (hence the extra call to cm_get_id).  Also if 
the remote CM lost its state information, it could re-use the remote ID 
for a new connection.

> 		return;
> 	}
> 
> 	timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
> 	if (timewait_info) {
> 		/* handle stale */

If the previous check fails, it does look like this should be stale.

> 		return;
> 	}
> 
> 	not a duplicate and not a stale connection
> 
> 2. Another question:
> 
> 	cm_dup_req_handler does this:
>        /* Quick state check to discard duplicate REQs. */
>         if (cm_id_priv->id.state == IB_CM_REQ_RCVD)
>                 return;
> 
> Why is this code here? IB_CM_REQ_RCVD is an ephemeural state,
> going to IB_CM_REP_SENT immediately.

The transition from REQ_RCVD to REP_SENT requires user intervention.  It 
is not immediate and can take several seconds to up to a minute 
depending on how quickly the user responds to connection requests.  For 
userspace apps, the timing depends on how quickly the user retrieves and 
processes CM events, which can take longer than the retry timeout.

> Why are duplicate REQs discarded? Should not REP be re-sent?
> See 12.9.6 COMMUNICATION ESTABLISHMENT - PASSIVE

A REP can only be re-sent if we're in the REP_SENT state, which is not 
the state being checked.  If the remote side has sent 3 REQs in the time 
that it takes to respond to the first REQ, it's inefficient to generate 
2 duplicate REPs when finally sending the first REP.

- Sean



More information about the general mailing list