[ofa-general] timeout question
Dotan Barak
dotanba at gmail.com
Fri May 16 18:54:54 PDT 2008
Rui Machado wrote:
> 2008/5/16 Roland Dreier <rdreier at cisco.com>:
>
>> > hmm..... and is there no workaround for this, for this situation? I
>> > mean, if the server dies isn't there any possibility that
>> > the sender/client realizes this. If the timeout it's too large this
>> > can be cumbersome.
>> >
>> > I tried reducing the timeout and indeed the client realizes faster
>> > when the server exits but another problem arises: Without exiting the
>> > server,
>> > on the client side I get the error (retry exceed) when polling for a
>> > recently posted send - this after some hours.
>>
>> There's a tradeoff between detecting real failures faster, and reducing
>> false errors detected because a response came too slowly.
>>
>> Clearly if a response may take an amount of time 'X' to be received
>> under normal conditions, there's no way to conclude that the remote side
>> has failed without waiting at least 'X'.
>>
>>
>
> I understand. So there's no really difference between the two
> situations, real server failure or just a load problem that takes more
> time?
>
From the sender QP point of view, they are the same (ack/nack wasn't
send during a specific
period of time)
> Something like a different error or a SIGPIPE :) ?
>
> I will describe my situation, maybe it helps (bare with me as I'm
> starting with Infiniband and so on)
> I have a client and a server.The clients posts RDMA calls one at a
> time (post, poll, post...). So server is just there.
> If I try to start something like 16 clients on 1 machine, after a few
> hours I will get an error on some client programs (retry excess) with
> a timeout of 14. If I increase the timeout for 32, I don't see that
> error but if I stop the server, the clients take a lot of time to
> acknowledge that, which is also not wanted.
> That's why I asked if there a 'good value'. If I have such a load
> between 2 nodes, I always have to risk that if the server dies the
> client will take much time to see it. That's not nice!
>
Did you try to increase the retry_count too?
(and not only the timeout).
By the way, Which RDMA operation do you execute READ or WRITE?
> Thanks for the help and quick answers,
>
You are always welcome ..
Dotan
More information about the general
mailing list