[ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ?

Thu Apr 12 07:21:31 PDT 2007

Roland:

Thanks for the suggestion. What is the minimum safe value of timeout for
typically IB network with 2-3 level of switch ?

--CQ

> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> Sent: Wednesday, April 11, 2007 10:48 PM
> To: Tang, Changqing
> Cc: Sean Hefty; general at lists.openfabrics.org
> Subject: Re: [ofa-general] RE: How fast to get 
> RDMA_CM_EVENT_DISCONNECTED ?
> 
>  > Yes, Internally in A, if the # of receives exceeds 
> lowwater(4), an ack  > will be sent back. I assume ACK is not 
> trigered at the moment.
>  > when A is trying to receive a message from B, and the 
> message never  > shows, A acctualy sends a heart beat back to 
> B, however, it takes  > serveral seconds for this heart-beat 
> to complete with error ( we  > configure timout ~1 sec, and 
> retry count 7).
>  >
>  > Serveral seconds to detect connection failure is not 
> acceptable for us,  > so if I use rdmacm, I want to know if I 
> detect the connection  > failure faster than heart-beat message.
> 
> I think there is an internal contradiction in what you're doing here.
> If your (ACK timeout) * (retry count) exceeds the time that 
> you consider acceptable to detect a failure, then you've set 
> your connection up wrong.  It's not even meaningful to talk 
> about a connection failing faster than this amount of time -- 
> a connection will recover from a transient network failure 
> that resolves itself before the last retry fails, and without 
> a time machine it's impossible to say whether a network 
> failure will or will not be resolved 7 seconds into the future.
> 
> Certainly if you receive a disconnect request, then you know 
> the remote side is really and truly gone.  But if you've set 
> your timeouts/retry counts so that connections will take 7 
> seconds to fail after an event like a link going down, then 
> there's no way to detect that failure before it occurs.
> 
> It seems to me the solution is to reduce your timeout and/or 
> retry count so that connections fail within the time scale 
> that you require.
> 
>  - R.
>