[ofa-general] IPOIB/CM increase retry counts

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Wed Feb 13 09:34:06 PST 2008


Or Gerlitz wrote:
> Pradeep Satyanarayana wrote:
>> I brought this issue up on the mailing list sometime in the summer of
>> 2007 is
>> my recollection. I could not locate that with a quick search of the
>> archives.
>> I will probably do that again later.
> 
> Its from December 2007
> http://lists.openfabrics.org/pipermail/general/2007-December/044299.html
> 
>> However, the crux of the issue is that I was seeing "send completion
>> errors" and
>> that is what prompted me to change the retry counts. Please see Table
>> 78 "Completion Error Handling for RC Send Queues" in the IB Spec for
>> reference.
>> And changing the retry counts did help.
> 
> I understand that changing the retry counts eliminated the issue you
> were seeing in your setup, however, its more of an observation than an
> actual problem statement whose solution can be judged. Apart from that,
> I have concerns regarding the approach of adding retries to layer that
> provides unreliable service, see my comments on the other emails, and
> feel free to respond there.

Hello Or,

Thanks for the pointer to the December mailing list. I have actually brought
up this issue much before that time. Here is the link:

http://lists.openfabrics.org/pipermail/general/2007-April/035308.html

I was seeing "send completion errors" which means the QP was torn down and
being recreated all the time. It was on account of this that I changed the
retry counts, not the other way round.

In this case the TCP timers are so large (hundreds of ms) compared to micro-seconds 
for Infiniband, that before TCP takes action to recover from errors, the QP is 
torn down (and recreated). As you can guess, the performance tanks.

I am not clear why you think that this was an observation rather than an actual
problem.

Pradeep





More information about the general mailing list