SPAM Re: [ofa-general] timeout question

Fri May 16 18:15:03 PDT 2008

Rui Machado wrote:
> Hi,
>
>   
>>> when setting the timeout in a struct ibv_qp_attr, this value
>>> corresponds to the Local ACK timeout which according to the Infiniband
>>> spec will define the transport timer timeout defined by the formula:
>>> 4.096uS * 2 ^Local Ack timeout". Is this right?
>>> And is there a value for this timeout to be considered "good practice"?
>>>
>>>       
>> This value is depend on your fabric size, on the HCA you have (and some more factors)..
>>     
>>> Also, in a client-server setup, if this timeout is set to a "big
>>> value" (like 30) when the server dies, the client will take that
>>> amount of time to realize the failure. Is this correct?
>>>
>>>       
>> Yes, after (at least) the calculated time * number of retry_count usec, the sender QP will get a retry exceeded
>> (if there was a SR which was posted without any response from the receiver).
>>
>>     
> hmm..... and is there no workaround for this, for this situation? I
> mean, if the server dies isn't there any possibility that
> the sender/client realizes this. If the timeout it's too large this
> can be cumbersome.
>
> I tried reducing the timeout and indeed the client realizes faster
> when the server exits but another problem arises: Without exiting the
> server,
> on the client side I get the error (retry exceed) when polling for a
> recently posted send - this after some hours.
>   
You don't really need to set a timeout of hours, I believe that a few 
seconds should be enough for
almost any (todays) cluster...


> Thank you for the help.
>   
You are welcome
:)

Dotan

***SPAM*** Re: [ofa-general] timeout question

SPAM Re: [ofa-general] timeout question