[openib-general] Question about QP's in timewait state and CM stale conn rejects

Sean Hefty mshefty at ichips.intel.com
Thu Aug 17 10:18:20 PDT 2006


Or Gerlitz wrote:
> If you don't mind (also related to the patch you have sent Eric of 
> randomizing the initial local cm id) to get into this deeper, can we do 

There's an issue trying to randomize the initial local CM ID.  The way the IDR 
works, if you start at a high value, then the IDR size grows up to the size of 
the first value, which can result in memory allocation failures.  In my tests, 
using a random value would frequently result in connection failures because of 
low memory.

My conclusion is that the local ID assignment in the IB CM needs to be reworked, 
or we will run into a condition that after X number of connections have been 
established, we will be unable to create any new connections, even if the 
previous connections have all been destroyed.

>> static struct cm_id_private * cm_match_req(struct cm_work *work,
>> +                                          struct cm_id_private 
>> *cm_id_priv)
>> +{
>> +       struct cm_id_private *listen_cm_id_priv, *cur_cm_id_priv;
>> +       struct cm_timewait_info *timewait_info;
>> +       struct cm_req_msg *req_msg;
>> +       unsigned long flags;
>> +
>> +       req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
>> +
>> +       /* Check for duplicate REQ and stale connections. */
>> +       spin_lock_irqsave(&cm.lock, flags);
>> +       timewait_info = cm_insert_remote_id(cm_id_priv->timewait_info);
>> +       if (!timewait_info)
>> +               timewait_info = 
>> cm_insert_remote_qpn(cm_id_priv->timewait_info);
> 
> 
> This if() holds when <remote_id, remote_ca_guid> entry is present in 
> remote_id_table OR <remote_qpn,remote_ca_guid> entry is present in 
> remote_qpn_table

correct

> 
>> +       if (timewait_info) {
>> +               cur_cm_id_priv = cm_get_id(timewait_info->work.local_id,
>> +                                          
>> timewait_info->work.remote_id);
> 
>  > +               spin_unlock_irqrestore(&cm.lock, flags);
> 
>> +               if (cur_cm_id_priv) {
>> +                       cm_dup_req_handler(work, cur_cm_id_priv);
>> +                       cm_deref_id(cur_cm_id_priv);
> 
> 
> <local_id, remote_id> entry exists in local_id_table, looking on 
> dup_req_handler() i see it sends REP when the id is in "MRA sent" and 
> sends a STALE_CONN REJ when the id is in timewait state, else it does 
> nothing.

It sends an MRA if in the MRA sent state, or a reject as indicated.

>> +               } else
>> +                       cm_issue_rej(work->port, work->mad_recv_wc,
>> +                                    IB_CM_REJ_STALE_CONN, 
>> CM_MSG_RESPONSE_REQ,
>> +                                    NULL, 0);
> 
> 
> what is this case? there is no <local_id,remote_id> entry but there is 
> remote <id,ca_guid> or <qpn,ca_guid> entries???

If we get here, this means that the REQ was a new REQ and not a duplicate, but 
the remote_id or remote_qpn is already in use.  We need to reject the new REQ as 
containing stale data.

- Sean




More information about the general mailing list