[ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ?
Tang, Changqing
changquing.tang at hp.com
Thu Apr 12 07:21:31 PDT 2007
Roland:
Thanks for the suggestion. What is the minimum safe value of timeout for
typically IB network with 2-3 level of switch ?
--CQ
> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com]
> Sent: Wednesday, April 11, 2007 10:48 PM
> To: Tang, Changqing
> Cc: Sean Hefty; general at lists.openfabrics.org
> Subject: Re: [ofa-general] RE: How fast to get
> RDMA_CM_EVENT_DISCONNECTED ?
>
> > Yes, Internally in A, if the # of receives exceeds
> lowwater(4), an ack > will be sent back. I assume ACK is not
> trigered at the moment.
> > when A is trying to receive a message from B, and the
> message never > shows, A acctualy sends a heart beat back to
> B, however, it takes > serveral seconds for this heart-beat
> to complete with error ( we > configure timout ~1 sec, and
> retry count 7).
> >
> > Serveral seconds to detect connection failure is not
> acceptable for us, > so if I use rdmacm, I want to know if I
> detect the connection > failure faster than heart-beat message.
>
> I think there is an internal contradiction in what you're doing here.
> If your (ACK timeout) * (retry count) exceeds the time that
> you consider acceptable to detect a failure, then you've set
> your connection up wrong. It's not even meaningful to talk
> about a connection failing faster than this amount of time --
> a connection will recover from a transient network failure
> that resolves itself before the last retry fails, and without
> a time machine it's impossible to say whether a network
> failure will or will not be resolved 7 seconds into the future.
>
> Certainly if you receive a disconnect request, then you know
> the remote side is really and truly gone. But if you've set
> your timeouts/retry counts so that connections will take 7
> seconds to fail after an event like a link going down, then
> there's no way to detect that failure before it occurs.
>
> It seems to me the solution is to reduce your timeout and/or
> retry count so that connections fail within the time scale
> that you require.
>
> - R.
>
More information about the general
mailing list