[ofa-general] timeout question

Rui Machado ruimario at gmail.com
Fri May 16 11:40:09 PDT 2008


2008/5/16 Roland Dreier <rdreier at cisco.com>:
>  > hmm..... and is there no workaround for this, for this situation? I
>  > mean, if the server dies isn't there any possibility that
>  > the sender/client realizes this. If the timeout it's too large this
>  > can be cumbersome.
>  >
>  > I tried reducing the timeout and indeed the client realizes faster
>  > when the server exits but another problem arises: Without exiting the
>  > server,
>  > on the client side I get the error (retry exceed) when polling for a
>  > recently posted send - this after some hours.
>
> There's a tradeoff between detecting real failures faster, and reducing
> false errors detected because a response came too slowly.
>
> Clearly if a response may take an amount of time 'X' to be received
> under normal conditions, there's no way to conclude that the remote side
> has failed without waiting at least 'X'.
>

I understand. So there's no really difference between the two
situations, real server failure or just a load problem that takes more
time?
Something like a different error or a SIGPIPE :) ?

I will describe my situation, maybe it helps (bare with me as I'm
starting with Infiniband and so on)
I have a client and a server.The clients posts RDMA calls one at a
time (post, poll, post...). So server is just there.
If I try to start something like 16 clients on 1 machine, after a few
hours I will get an error on some client programs (retry excess) with
a timeout of 14. If I increase the timeout for 32, I don't see that
error but if I stop the server, the clients take a lot of time to
acknowledge that, which is also not wanted.
That's why I asked  if there a 'good value'. If I have such a load
between 2 nodes, I always have to risk that if the server dies the
client will take much time to see it. That's not nice!

Thanks for the help and quick answers,

Rui



More information about the general mailing list