[ofa-general] timeout question

Rui Machado ruimario at gmail.com
Fri May 16 11:10:56 PDT 2008


Hi,

>>
>> when setting the timeout in a struct ibv_qp_attr, this value
>> corresponds to the Local ACK timeout which according to the Infiniband
>> spec will define the transport timer timeout defined by the formula:
>> 4.096uS * 2 ^Local Ack timeout". Is this right?
>> And is there a value for this timeout to be considered "good practice"?
>>
> This value is depend on your fabric size, on the HCA you have (and some more factors)..
>> Also, in a client-server setup, if this timeout is set to a "big
>> value" (like 30) when the server dies, the client will take that
>> amount of time to realize the failure. Is this correct?
>>
> Yes, after (at least) the calculated time * number of retry_count usec, the sender QP will get a retry exceeded
> (if there was a SR which was posted without any response from the receiver).
>
hmm..... and is there no workaround for this, for this situation? I
mean, if the server dies isn't there any possibility that
the sender/client realizes this. If the timeout it's too large this
can be cumbersome.

I tried reducing the timeout and indeed the client realizes faster
when the server exits but another problem arises: Without exiting the
server,
on the client side I get the error (retry exceed) when polling for a
recently posted send - this after some hours.

Thank you for the help.


Rui



More information about the general mailing list