[ofa-general] timeout question

Fri May 16 12:01:07 PDT 2008

2008/5/17 Dotan Barak <dotanba at gmail.com>:
> Rui Machado wrote:
>>
>> 2008/5/16 Roland Dreier <rdreier at cisco.com>:
>>
>>>
>>>  > hmm..... and is there no workaround for this, for this situation? I
>>>  > mean, if the server dies isn't there any possibility that
>>>  > the sender/client realizes this. If the timeout it's too large this
>>>  > can be cumbersome.
>>>  >
>>>  > I tried reducing the timeout and indeed the client realizes faster
>>>  > when the server exits but another problem arises: Without exiting the
>>>  > server,
>>>  > on the client side I get the error (retry exceed) when polling for a
>>>  > recently posted send - this after some hours.
>>>
>>> There's a tradeoff between detecting real failures faster, and reducing
>>> false errors detected because a response came too slowly.
>>>
>>> Clearly if a response may take an amount of time 'X' to be received
>>> under normal conditions, there's no way to conclude that the remote side
>>> has failed without waiting at least 'X'.
>>>
>>>
>>
>> I understand. So there's no really difference between the two
>> situations, real server failure or just a load problem that takes more
>> time?
>>
>
> From the sender QP point of view, they are the same (ack/nack wasn't send
> during a specific
> period of time)
>>
>> Something like a different error or a SIGPIPE :) ?
>>
>> I will describe my situation, maybe it helps (bare with me as I'm
>> starting with Infiniband and so on)
>> I have a client and a server.The clients posts RDMA calls one at a
>> time (post, poll, post...). So server is just there.
>> If I try to start something like 16 clients on 1 machine, after a few
>> hours I will get an error on some client programs (retry excess) with
>> a timeout of 14. If I increase the timeout for 32, I don't see that
>> error but if I stop the server, the clients take a lot of time to
>> acknowledge that, which is also not wanted.
>> That's why I asked  if there a 'good value'. If I have such a load
>> between 2 nodes, I always have to risk that if the server dies the
>> client will take much time to see it. That's not nice!
>>
>
> Did you try to increase the retry_count too?
> (and not only the timeout).

But that wouldn't change my scenario since the overall time is given
by the timeout * retry count right?

> By the way, Which RDMA operation do you execute READ or WRITE?
>>
READ.

>> Thanks for the help and quick answers,
>>
>
> You are always welcome ..

Great :)
Cheers,

Rui