[ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ?

Roland Dreier rdreier at cisco.com
Wed Apr 11 20:48:17 PDT 2007


 > Yes, Internally in A, if the # of receives exceeds lowwater(4), an ack
 > will be sent back. I assume ACK is not trigered at the moment.
 > when A is trying to receive a message from B, and the message never
 > shows, A acctualy sends a heart beat back to B, however, it takes
 > serveral seconds for this heart-beat to complete with error ( we
 > configure timout ~1 sec, and retry count 7).
 > 
 > Serveral seconds to detect connection failure is not acceptable for us,
 > so if I use rdmacm, I want to know if I detect the connection
 > failure faster than heart-beat message.

I think there is an internal contradiction in what you're doing here.
If your (ACK timeout) * (retry count) exceeds the time that you
consider acceptable to detect a failure, then you've set your
connection up wrong.  It's not even meaningful to talk about a
connection failing faster than this amount of time -- a connection
will recover from a transient network failure that resolves itself
before the last retry fails, and without a time machine it's
impossible to say whether a network failure will or will not be
resolved 7 seconds into the future.

Certainly if you receive a disconnect request, then you know the
remote side is really and truly gone.  But if you've set your
timeouts/retry counts so that connections will take 7 seconds to
fail after an event like a link going down, then there's no way to
detect that failure before it occurs.

It seems to me the solution is to reduce your timeout and/or retry
count so that connections fail within the time scale that you require.

 - R.



More information about the general mailing list