[ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review

Thu Apr 19 04:51:19 PDT 2007

> Quoting Pradeep Satyanarayana <pradeep at us.ibm.com>:
> Subject: IPOIB CM (NOSRQ)[PATCH V2] patch for review
> 
> Here is a second version of the IPOIB_CM_NOSRQ patch for review. This 
> patch will benefit adapters that do not support shared receive queues.
> 
> This patch incorporates the previous review comments:
> -#ifdefs removed and a single binary drives HCAs that do and do not 
> support SRQs
> -avoids linear traversal through a list of QPs
> -extraneous code removed
> -compile time selection removed 
> -No HTML version as part of this patch

The patch is still severely line-wrapped, to the point of unreadability.
Look at it here:
http://article.gmane.org/gmane.linux.drivers.openib/38681

> This patch has been tested with linux-2.6.21-rc5 and rc7 with Topspin and 
> IBM HCAs on ppc64 machines. I have run
> netperf between two IBM HCAs and two Topspin HCAs, as well as between IBM 
> and Topspin HCA.
> 
> Note 1: There was interesting discovery that I made when I ran netperf 
> between Topsin and IBM HCA. I started to see
> the IB_WC_RETRY_EXC_ERR error upon send completion. This may have been due 
> to the differences in the
> processing speeds of the two HCA. This was rectified by seting the 
> retry_count to a non-zero value in ipoib_cm_send_req().
> I had to do this inspite of the comment  --> /* RFC draft warns against 
> retries */ 

This would only help if there are short bursts of high-speed activity
on the receiving HCA: if the speed is different in the long run,
the right thing to do is to drop some packets and have TCP adjust
its window accordingly.

But in that former case (short bursts), just increasing the number of pre-posted
buffers on RQ should be enough, and looks like a much cleaner solution.

Long-term, I think we should use the watermark event to dynamically
adjust the number of RQ buffers with the incoming traffic.
I'll try to work on such a patch probably for 2.6.23 timeframe.

> Can someone point me to where this comment is in the RFC? I would like to 
> understand the reasoning.

See "7.1 A Cautionary Note on IPoIB-RC".
See also classics such as http://sites.inka.de/~W1011/devel/tcp-tcp.html

By the way, as long as you are not using SRQ, why not use UC mode QPs?
This would look like a cleaner solution.

You can also try making the RNR condition cheaper to handle, by moving
the QP to RST and back to RTR and then to RTS instead of re-initiating
a new connection.

Unfortunately, I haven't the time to review the patch thoroughly in the coming
couple of weeks. A general comment however:

> @@ -360,7 +489,16 @@ void ipoib_cm_handle_rx_wc(struct net_de
>                                  return;
>                  }
>  
> -                skb  = priv->cm.srq_ring[wr_id].skb;
> +                if(priv->cm.srq)
> +                                skb  = priv->cm.srq_ring[wr_id].skb;
> +                else {
> +                                index = (wc->wr_id & ~IPOIB_CM_OP_NOSRQ) 
> & NOSRQ_INDEX_MASK ;
> +                                spin_lock_irqsave(&priv->lock, flags);
> +                                rx_ptr = priv->cm.rx_index_ring[index];
> +                                spin_unlock_irqrestore(&priv->lock, 
> flags);
> +
> +                                skb     = rx_ptr->rx_ring[wr_id].skb;
> +                } /* NOSRQ */
>  
>                  if (unlikely(wc->status != IB_WC_SUCCESS)) {
>                                  ipoib_dbg(priv, "cm recv error "

In this, and other examples, you scatter "if priv->cm.srq" tests all over
the code. I think it would be much cleaner in most cases to separate
the non-SRQ code to separate functions.

If there's common SRQ/non-SRQ code it can be factored out and reused in both
places. In cases such as the above this also has speed advantages: from both
cache footprint as well as branch prediction POV.
You can even have a separate event handler for SRQ/non-SRQ code, avoiding
mode tests on data path completely.

-- 
MST