[ofa-general] IPOIB/CM increase retry counts

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Wed Feb 13 11:36:58 PST 2008


Or Gerlitz wrote:
> On 2/13/08, Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com> wrote:
>> Or Gerlitz wrote:
> 
>>> I understand that changing the retry counts eliminated the issue you
>>> were seeing in your setup, however, its more of an observation than an
>>> actual problem statement whose solution can be judged.
> 
>> I am not clear why you think that this was an observation rather than an actual problem.
> 
> I did not mean to say that there is no actual problem, I just don't
> see here an actual evidence that proves or suggests that indeed
> --the-- problem is different speeds of the HCAs, the fact thay adding
> retries eliminated the send errors is not enough. For example, maybe
> adding just RNR retries would do well? maybe just adding retries
> would? maybe you were seeing it at April 2007 before NAPI was
> implemented? etc, etc. I have sent a note on that to the ewg list
> asking if people can reproduce the problem. Best if you can name two
> HCA types + FW version + node setting + test that can reproduce the
> problem.

Unfortunately, I do not have the same setup that I had previously. So, I would
be unable to provide you all the details at this point. However, I do remember
it was ehca and mthca on ppc64 machines.

If memory serves me right, just adding the retries solved the issue. However,
as pointed out in Table 78 of the IB spec I changed rnr_retries too as that could
be a possibility too. I wanted to cover that case (rnr_retries) if some else ran
into it.

> 
> Also, you did well without this patch in the code for 10 months now,
> so I don't see why it has to go into ofed in such a rush, the fact
> that Roland missed commenting on it twice, should not stop you from
> sending it to him in the  third time... maintainers are busy, it
> happens.

The fact is I have always been running that with that change on my systems.
As you will see from the history of the patches, I did not want that to be a 
sticking point and removed that from the mainline patch. The plan was to reopen the
conversation to get it into mainline after OFED 1.3. You may have seen that I have tried
bringing up that issue several times in the past.

Pradeep




More information about the general mailing list