[ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review

Pradeep Satyanarayana pradeep at us.ibm.com
Mon Apr 23 16:53:31 PDT 2007


"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 04/23/2007 01:50:32 
PM:


> > > 
> > > This would only help if there are short bursts of high-speed 
activity
> > > on the receiving HCA: if the speed is different in the long run,
> > > the right thing to do is to drop some packets and have TCP adjust
> > > its window accordingly.
> > > 
> > > But in that former case (short bursts), just increasing the number 
> > > of pre-posted
> > > buffers on RQ should be enough, and looks like a much cleaner 
solution.
> > 
> > This was not an issue with running out of buffers (which was my 
original 
> > suspicion too). This was probably due to missing ACKs -I am guessing 
> > this happens because the two HCAs have very different processing 
speeds.
> 
> I don't see how different processing speeds could trigger missing ACKs.
> Do you?

Note: In the netperf tests errors were seen only when one side is ehca and 
the other side
is mthca. When both sides are ehca or mthca no errors are seen.

In the netperf tests I observed that ehca encountered lots 
of send completion errors. ehca encountered send completion errors whether 
it was 
the sender or the receiver (presumably sending Acks when it was the 
receiver). 
On the contrary mthca reported no errors -even when I changed 
/sys/module/ib_mthca/parameters/debug_level to 1
(that is the way to turn on debug on mthca -right?).

With the Local CA Delay Ack set to 0 on ehca, I believe it is probably 
taking mthca 
more than 16us to deliver the Ack back to ehca. It might not be exactly 16 
us,
but I just assumed 4 times the Local CA Delay Ack (as per the spec) of 
4us. That triggers
the send completion error on ehca.

On the other hand, when two ehca adapters use RC, no errors are 
encountered 
implying that the Ack is consistently delivered within 16us.

Since mthca sets the Local CA Delay Ack value to 15, the timeouts between 
two mthcas
are much larger (> 128 ms)and hence no problems are encountered. It is for 
that reason I stated
that different processing speeds may be trigerring the missing Acks.

> 
> > This is exacerbated by the fact that retry count (not RNR retry 
count)was 0.
> > When I changed the retry count to a small values like 3 it still 
works.
> > Please see below for additional details.
> 
> Looks like work-around for some breakage elsewhere.
> Maybe it's a good thing we don't retry in such cases - retries are not 
good
> for network performance, and this way we move the problem to it's
> root cause where it can be debugged and fixed instead of overloading
> the network.

There is no single value all HCAs can pick and provide optimal performance
in all situations. The only way would be to select a certain value that is
optimal for each HCA, and depend on a retry mechanism when the selected 
value
does not meet the needs of interoperability. To depend on higher levels 
like
TCP or even the application to do the retries will kill performance.

> 
> > > > Can someone point me to where this comment is in the RFC? I 
> would like to 
> > > > understand the reasoning.
> > > 
> > > See "7.1 A Cautionary Note on IPoIB-RC".
> > > See also classics such as 
http://sites.inka.de/~W1011/devel/tcp-tcp.html
> > 
> > 
> > If we do this right, the above mentioned problems should not 
> occur. In the case
> > we are dealing with the RC timers are expected to be much smaller 
(than TCP
> > timers) and
> > should not interfere with TCP timers. The IBM HCA uses a default 
> value of 0 for
> > the Local CA Ack Delay;
> > which is probably too small a value and with a retry
> > count of 0, ACKs are missed. I agree with Roland's assessment (this 
was in a
> > seperate thread), that this should not be 0.
> 
> So, it's an ehca bug then?
> I didn't really get the explanation. Who loses the ACKs? ehca?
> It is the case that ehca *reports* Local CA Ack Delay that is
> *below* what it actually provides? If so, it should be easy to fix in 
driver.

Yes, there is a problem with the IBM HCA, and we will address this.  I 
stated as
much, when I concurred with Roland's assessment.

> 
> > On the other hand with the Topspin adapter (and mthca) that I have the 

> > Local CA Ack Delay is 0xf which would imply a Local Ack Timeout of
> 4.096us * 2^15 which 
> > is about 128ms. The IB spec says it can be upto 4 times this value
> which means upto 
> > 512 ms.
> > 
> > The smallest TCP retransmission timer is HZ/5 which is 200 ms on 
several 
> > architectures.
> > Yes, even with a retry count of 1 or 2, there is then a risk of 
> > interfering with TCP timers.
> > 
> > If my understanding is correct, the way its should be done is to 
> have a small
> > value for the Local CA Ack Delay like say 3 or 4 which would implya 
Timeout
> > value of 32-64us, with a small retry count of 2 or 3. This way the
> max Timeout
> > would be still be only several hundreds of us, a factor of 1000 
> less than the
> > minimum TCP timeout. IB adapters are supposed to have a much smaller 
latency
> > than ethernet adapters, so I am guessing that this would be in the
> ballpark for
> > most HCAs.
> > 
> > Unfortunately I do not know how much of an effort it will take to 
change the
> > Local CA Delay Ack across the various HCAs (if need be).
> 
> How about fixing ehca not to trigger ACK loss instead?

As previously stated, IBM HCA will address these issues. However, my 
understanding is
that mthca/Topspin adapters also have a problem (too high a value for the 
Local
CA Delay Ack). Both HCAs need to be fixed for good interoperability.


> 
> > In the interim, the
> > only parameter we can control is the retry count and we could make
> this a module
> > parameter.
> 
> Since both 0 and > 0 values might lead to problems, this does not
> look like a real solution.
> 

Please see previous reasoning as to why we need a retry mecahnism.

> > > 
> > > By the way, as long as you are not using SRQ, why not use UC mode 
QPs?
> > > This would look like a cleaner solution.
> 
> You haven't addressed this, and this might be a better way out. 
> Unreliable SRQ
> being only supported for RC QPs now is really one of the major 
> reasons IPoIB CM
> uses RC rather than UC.
> 

This is a good point you make. However, this will not address the core 
issue of missing 
Acks -the difference in processing speeds. What happens when the next 
version of IBM HCA
(or for that matter HCA from any other vendor) supporting SRQ comes out?

Pradeep
pradeep at us.ibm.com



More information about the general mailing list