[ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review

Mon Apr 23 12:58:24 PDT 2007

"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 04/19/2007 04:51:19 
AM:

> > Quoting Pradeep Satyanarayana <pradeep at us.ibm.com>:
> > Subject: IPOIB CM (NOSRQ)[PATCH V2] patch for review
> > 
> > Here is a second version of the IPOIB_CM_NOSRQ patch for review. This 
> > patch will benefit adapters that do not support shared receive queues.
> > 
> > This patch incorporates the previous review comments:
> > -#ifdefs removed and a single binary drives HCAs that do and do not 
> > support SRQs
> > -avoids linear traversal through a list of QPs
> > -extraneous code removed
> > -compile time selection removed 
> > -No HTML version as part of this patch
> 
> The patch is still severely line-wrapped, to the point of unreadability.
> Look at it here:
> http://article.gmane.org/gmane.linux.drivers.openib/38681

I expect to be able to address this the next time.

> 
> > This patch has been tested with linux-2.6.21-rc5 and rc7 with Topspin 
and 
> > IBM HCAs on ppc64 machines. I have run
> > netperf between two IBM HCAs and two Topspin HCAs, as well as between 
IBM 
> > and Topspin HCA.
> > 
> > Note 1: There was interesting discovery that I made when I ran netperf 

> > between Topsin and IBM HCA. I started to see
> > the IB_WC_RETRY_EXC_ERR error upon send completion. This may have been 
due 
> > to the differences in the
> > processing speeds of the two HCA. This was rectified by seting the 
> > retry_count to a non-zero value in ipoib_cm_send_req().
> > I had to do this inspite of the comment  --> /* RFC draft warns 
against 
> > retries */ 
> 
> This would only help if there are short bursts of high-speed activity
> on the receiving HCA: if the speed is different in the long run,
> the right thing to do is to drop some packets and have TCP adjust
> its window accordingly.
> 
> But in that former case (short bursts), just increasing the number 
> of pre-posted
> buffers on RQ should be enough, and looks like a much cleaner solution.

This was not an issue with running out of buffers (which was my original 
suspicion too). This was probably due to missing ACKs -I am guessing 
this happens because the two HCAs have very different processing speeds.

This is exacerbated by the fact that retry count (not RNR retry count)was 
0. 
When I changed the retry count to a small values like 3 it still works. 
Please
see below for additional details.

> 
> Long-term, I think we should use the watermark event to dynamically
> adjust the number of RQ buffers with the incoming traffic.
> I'll try to work on such a patch probably for 2.6.23 timeframe.
> 
> > Can someone point me to where this comment is in the RFC? I would like 
to 
> > understand the reasoning.
> 
> See "7.1 A Cautionary Note on IPoIB-RC".
> See also classics such as http://sites.inka.de/~W1011/devel/tcp-tcp.html

If we do this right, the above mentioned problems should not occur. In the 
case
we are dealing with the RC timers are expected to be much smaller (than 
TCP timers)
and should not interfere with TCP timers. The IBM HCA uses a default value 
of 0 
for the Local CA Ack Delay; which is probably too small a value and with a 
retry 
count of 0, ACKs are missed. I agree with Roland's assessment (this was in 
a seperate
thread), that this should not be 0.

On the other hand with the Topspin adapter (and mthca) that I have the 
Local CA Ack
Delay is 0xf which would imply a Local Ack Timeout of 4.096us * 2^15 which 
is about 
128ms. The IB spec says it can be upto 4 times this value which means upto 
512 ms.

The smallest TCP retransmission timer is HZ/5 which is 200 ms on several 
architectures.
Yes, even with a retry count of 1 or 2, there is then a risk of 
interfering with TCP timers.

If my understanding is correct, the way its should be done is to have a 
small value for the
Local CA Ack Delay like say 3 or 4 which would imply a Timeout value of 
32-64us, with a 
small retry count of 2 or 3. This way the max Timeout would be still be 
only several
hundreds of us, a factor of 1000 less than the minimum TCP timeout. IB 
adapters are supposed
to have a much smaller latency than ethernet adapters, so I am guessing 
that this would be 
in the ballpark for most HCAs.

Unfortunately I do not know how much of an effort it will take to change 
the Local CA Delay Ack 
across the various HCAs (if need be). In the interim, the only parameter 
we can control is the 
retry count and we could make this a module parameter.

> 
> By the way, as long as you are not using SRQ, why not use UC mode QPs?
> This would look like a cleaner solution.
> 
> You can also try making the RNR condition cheaper to handle, by moving
> the QP to RST and back to RTR and then to RTS instead of re-initiating
> a new connection.
> 
> Unfortunately, I haven't the time to review the patch thoroughly in the 
coming
> couple of weeks. A general comment however:
> 
> > @@ -360,7 +489,16 @@ void ipoib_cm_handle_rx_wc(struct net_de
> >                                  return;
> >                  }
> > 
> > -                skb  = priv->cm.srq_ring[wr_id].skb;
> > +                if(priv->cm.srq)
> > +                                skb  = priv->cm.srq_ring[wr_id].skb;
> > +                else {
> > +                                index = (wc->wr_id & 
~IPOIB_CM_OP_NOSRQ) 
> > & NOSRQ_INDEX_MASK ;
> > +                                spin_lock_irqsave(&priv->lock, 
flags);
> > +                                rx_ptr = 
priv->cm.rx_index_ring[index];
> > +                                spin_unlock_irqrestore(&priv->lock, 
> > flags);
> > +
> > +                                skb     = rx_ptr->rx_ring[wr_id].skb;
> > +                } /* NOSRQ */
> > 
> >                  if (unlikely(wc->status != IB_WC_SUCCESS)) {
> >                                  ipoib_dbg(priv, "cm recv error "
> 
> In this, and other examples, you scatter "if priv->cm.srq" tests all 
over
> the code. I think it would be much cleaner in most cases to separate
> the non-SRQ code to separate functions.
> 
> If there's common SRQ/non-SRQ code it can be factored out and reused in 
both
> places. In cases such as the above this also has speed advantages: from 
both
> cache footprint as well as branch prediction POV.
> You can even have a separate event handler for SRQ/non-SRQ code, 
avoiding
> mode tests on data path completely.

I wanted to avoid a lot of duplicate code and also not have 
maintainability headaches
in the future. I will factor out some of the common code and repost the 
patch.

> 
> -- 
> MST

Pradeep
pradeep at us.ibm.com