[openib-general] 75 second timeout for endpoint state to go from Disconnect_Pending to Disconnected
mark kowalski
mkowalski01 at gmail.com
Wed Mar 23 08:36:31 PST 2005
Hello,
I've been doing some work with udapl trying to recover from port
failures and have run into a problem. I have a simple test program
that contains a server and a client, running on two different
machines, sending data back and forth. When I have a physical
connection problem on the client side (caused by pulling the ib cable
from the inuse port on the hca) the server will see this and
eventually issue a dat_ep_disconnect (gracefully) and then go and wait
for the client to reconnect to it. The problem is that it is taking
about 75 seconds for the end point on the server to go from
DISCONNECT_PENDING state to DISCONNECTED. The
TS_UDAPL_CM_RESPONSE_TIMEOUT field specifies a timeout of 4.x seconds
and it looks like it is being setup correctly. The
TS_UDAPL_MAX_CM_RETRIES is set to 15 so we thought that for some
reason the disconnect request is being retried the max number of
times before it completes and that is why I'm seeing a 75 second
wait.
We have tried modifying the TS_UDAPL_MAX_CM_RETRIES in
dapl_openib_cm.h from 15 to 2 to see if this would cause it to
disconnect faster but using a catc tool to examine the packets as it
went across the wire we found that 15 was still being passed as the
max retry count. A side issue to this problem is how can you change
the retry and timeout value and have it accepted?
Changing the disconnect to ABRUPT doesn't matter because even
though the endpoint status will be immediately displayed as
DISCONNECTED, when the server tries to accept the reconnection
request from the client the cr_accept fails. As long as the server
waits until the status of the endpoint changes from disconnect_pending
to disconnected before processing the client connect request then the
connection can be reestablished and data transmissions restarted.
Does anyone know why it is taking so long for the server end point
to disconnect or why the retry count change did not seem to be
accepted?
Thanks in advance for any help,
Mark Kowalski
More information about the general
mailing list