[ofa-general] Re: [RFC][PATCH] last recv race patch

Fri Jun 27 21:39:28 PDT 2008

 >       The race is between the list is taken off from reap_list for destory
 > because of the timeout and readd to reap_list from this condition. So we
 > need a condition check here.

I'm sorry, I still don't see what the bug is.  What timeout removes
things from the rx_reap_list?  And how could any context get added to
the rx_reap_list more than once?  The code you are changing is:

 >                   if (!--p->recv_count) {
 >                         spin_lock_irqsave(&priv->lock, flags);
 > -                       list_move(&p->list, &priv->cm.rx_reap_list);
 > -                       spin_unlock_irqrestore(&priv->lock, flags);
 > -                       queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
 > +                       if (p->state == IPOIB_CM_RX_LIVE) {
 > +                             list_move(&p->list, &priv->cm.rx_reap_list);
 > +                             spin_unlock_irqrestore(&priv->lock, flags);
 > +                             queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
 > +                       } else
 > +                             spin_unlock_irqrestore(&priv->lock, flags);

and I don't see how recv_count could ever be 0 twice: we never increment it.

In any case I don't see how testing against the LIVE state could be
correct, since for example we want to reap stale QPs that we moved to
IPOIB_CM_RX_ERROR after all the pending receives have completed.

 - R.