[ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review

Mon Apr 23 13:50:32 PDT 2007

> > > This patch has been tested with linux-2.6.21-rc5 and rc7 with Topspin and 
> > > IBM HCAs on ppc64 machines. I have run
> > > netperf between two IBM HCAs and two Topspin HCAs, as well as between IBM 
> > > and Topspin HCA.
> > > 
> > > Note 1: There was interesting discovery that I made when I ran netperf 
> 
> > > between Topsin and IBM HCA. I started to see
> > > the IB_WC_RETRY_EXC_ERR error upon send completion. This may have been due 
> > > to the differences in the
> > > processing speeds of the two HCA. This was rectified by seting the 
> > > retry_count to a non-zero value in ipoib_cm_send_req().
> > > I had to do this inspite of the comment  --> /* RFC draft warns against 
> > > retries */ 
> > 
> > This would only help if there are short bursts of high-speed activity
> > on the receiving HCA: if the speed is different in the long run,
> > the right thing to do is to drop some packets and have TCP adjust
> > its window accordingly.
> > 
> > But in that former case (short bursts), just increasing the number 
> > of pre-posted
> > buffers on RQ should be enough, and looks like a much cleaner solution.
> 
> This was not an issue with running out of buffers (which was my original 
> suspicion too). This was probably due to missing ACKs -I am guessing 
> this happens because the two HCAs have very different processing speeds.

I don't see how different processing speeds could trigger missing ACKs.
Do you?

> This is exacerbated by the fact that retry count (not RNR retry count)was 0.
> When I changed the retry count to a small values like 3 it still works.
> Please see below for additional details.

Looks like work-around for some breakage elsewhere.
Maybe it's a good thing we don't retry in such cases - retries are not good
for network performance, and this way we move the problem to it's
root cause where it can be debugged and fixed instead of overloading the network.

> > > Can someone point me to where this comment is in the RFC? I would like to 
> > > understand the reasoning.
> > 
> > See "7.1 A Cautionary Note on IPoIB-RC".
> > See also classics such as http://sites.inka.de/~W1011/devel/tcp-tcp.html
> 
> 
> If we do this right, the above mentioned problems should not occur. In the case
> we are dealing with the RC timers are expected to be much smaller (than TCP
> timers) and
> should not interfere with TCP timers. The IBM HCA uses a default value of 0 for
> the Local CA Ack Delay;
> which is probably too small a value and with a retry
> count of 0, ACKs are missed. I agree with Roland's assessment (this was in a
> seperate thread), that this should not be 0.

So, it's an ehca bug then?
I didn't really get the explanation. Who loses the ACKs? ehca?
It is the case that ehca *reports* Local CA Ack Delay that is
*below* what it actually provides? If so, it should be easy to fix in driver.

> On the other hand with the Topspin adapter (and mthca) that I have the 
> Local CA Ack Delay is 0xf which would imply a Local Ack Timeout of 4.096us * 2^15 which 
> is about 128ms. The IB spec says it can be upto 4 times this value which means upto 
> 512 ms.
> 
> The smallest TCP retransmission timer is HZ/5 which is 200 ms on several 
> architectures.
> Yes, even with a retry count of 1 or 2, there is then a risk of 
> interfering with TCP timers.
> 
> If my understanding is correct, the way its should be done is to have a small
> value for the Local CA Ack Delay like say 3 or 4 which would imply a Timeout
> value of 32-64us, with a small retry count of 2 or 3. This way the max Timeout
> would be still be only several hundreds of us, a factor of 1000 less than the
> minimum TCP timeout. IB adapters are supposed to have a much smaller latency
> than ethernet adapters, so I am guessing that this would be in the ballpark for
> most HCAs.
> 
> Unfortunately I do not know how much of an effort it will take to change the
> Local CA Delay Ack across the various HCAs (if need be).

How about fixing ehca not to trigger ACK loss instead?

> In the interim, the
> only parameter we can control is the retry count and we could make this a module
> parameter.

Since both 0 and > 0 values might lead to problems, this does not
look like a real solution.

> > 
> > By the way, as long as you are not using SRQ, why not use UC mode QPs?
> > This would look like a cleaner solution.

You haven't addressed this, and this might be a better way out.  Unreliable SRQ
being only supported for RC QPs now is really one of the major reasons IPoIB CM
uses RC rather than UC.

-- 
MST