[ofa-general] IB/cm: bug in stale connection detection logic?
Michael S. Tsirkin
mst at dev.mellanox.co.il
Sun May 20 06:44:41 PDT 2007
Sean, Roland, pls comment on the following 2 questions:
1. I see this in cm_match_req:
timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
if (!timewait_info)
timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
if (timewait_info) {
cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
timewait_info->work.remote_id);
cm_cleanup_timewait(cm_id_priv->timewait_info);
spin_unlock_irqrestore(&cm.lock, flags);
if (cur_cm_id_priv) {
cm_dup_req_handler(work, cur_cm_id_priv);
cm_deref_id(cur_cm_id_priv);
} else
cm_issue_rej(work->port, work->mad_recv_wc,
IB_CM_REJ_STALE_CONN, CM_MSG_RESPONSE_REQ,
NULL, 0);
Note that cm_get_id is passed data from timewait_info and not from the request:
thus, if the QPN in request matches QPN in an existing connection, this is
mis-detected as a duplicate request even if the IDs do not match;
thus, the request is dropped or "duplicate" reject is sent instead of
a "stale connection" reject.
Am I missing something?
Suggestion:
Why is an extra call to cm_get_id required to detect a duplicate?
Shouldn't we just
timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
if (timewait_info) {
/* handle duplicate */
return;
}
timewait_info = cm_insert_remote_qpn(cm_id_priv->timewait_info);
if (timewait_info) {
/* handle stale */
return;
}
not a duplicate and not a stale connection
2. Another question:
cm_dup_req_handler does this:
/* Quick state check to discard duplicate REQs. */
if (cm_id_priv->id.state == IB_CM_REQ_RCVD)
return;
Why is this code here? IB_CM_REQ_RCVD is an ephemeural state,
going to IB_CM_REP_SENT immediately.
Why are duplicate REQs discarded? Should not REP be re-sent?
See 12.9.6 COMMUNICATION ESTABLISHMENT - PASSIVE
The spec also says:
The general rule for handling any input message that is received while in
an ephemeral state is to hold that message pending until the CM protocol
enters a non-ephemeral state.
So this code looks wrong.
What am I missing?
--
MST
More information about the general
mailing list