[ofa-general] timeout question

Fri May 16 11:40:09 PDT 2008

2008/5/16 Roland Dreier <rdreier at cisco.com>:
>  > hmm..... and is there no workaround for this, for this situation? I
>  > mean, if the server dies isn't there any possibility that
>  > the sender/client realizes this. If the timeout it's too large this
>  > can be cumbersome.
>  >
>  > I tried reducing the timeout and indeed the client realizes faster
>  > when the server exits but another problem arises: Without exiting the
>  > server,
>  > on the client side I get the error (retry exceed) when polling for a
>  > recently posted send - this after some hours.
>
> There's a tradeoff between detecting real failures faster, and reducing
> false errors detected because a response came too slowly.
>
> Clearly if a response may take an amount of time 'X' to be received
> under normal conditions, there's no way to conclude that the remote side
> has failed without waiting at least 'X'.
>

I understand. So there's no really difference between the two
situations, real server failure or just a load problem that takes more
time?
Something like a different error or a SIGPIPE :) ?

I will describe my situation, maybe it helps (bare with me as I'm
starting with Infiniband and so on)
I have a client and a server.The clients posts RDMA calls one at a
time (post, poll, post...). So server is just there.
If I try to start something like 16 clients on 1 machine, after a few
hours I will get an error on some client programs (retry excess) with
a timeout of 14. If I increase the timeout for 32, I don't see that
error but if I stop the server, the clients take a lot of time to
acknowledge that, which is also not wanted.
That's why I asked  if there a 'good value'. If I have such a load
between 2 nodes, I always have to risk that if the server dies the
client will take much time to see it. That's not nice!

Thanks for the help and quick answers,

Rui