[openib-general] 75 second timeout for endpoint state to go from Disconnect_Pending to Disconnected

mark kowalski mkowalski01 at gmail.com
Wed Mar 23 08:36:31 PST 2005


Hello,   
    I've been doing some work with udapl trying to recover from port
failures and have run  into a problem.  I have a simple test program
that contains a server and a client, running  on two different
machines, sending data back and forth.  When I have a physical 
connection problem on the client side (caused by pulling the ib cable
from the inuse port  on the hca) the server will see this and
eventually issue a dat_ep_disconnect (gracefully) and then go and wait
for the client to reconnect to it.  The problem is that it is taking
about 75 seconds for the end point on the server to go from
DISCONNECT_PENDING state to  DISCONNECTED.   The
TS_UDAPL_CM_RESPONSE_TIMEOUT field specifies a  timeout of 4.x seconds
and it looks like it is being setup correctly.  The 
TS_UDAPL_MAX_CM_RETRIES is set to 15 so we thought that for  some
reason the  disconnect request is being retried the max number of
times before it completes and that  is why I'm seeing a 75 second
wait.
     We have tried modifying the TS_UDAPL_MAX_CM_RETRIES in
dapl_openib_cm.h  from 15 to 2 to see if this would cause it to
disconnect faster but using a catc tool to   examine the packets as it
went across the wire we found that 15 was still being passed  as the
max retry count.  A side issue to this problem is  how can you change
the retry and  timeout value and have it accepted?
     Changing the disconnect to ABRUPT doesn't matter because even
though the  endpoint status will be immediately displayed as
DISCONNECTED, when the server tries   to accept the reconnection
request from the client the cr_accept fails.   As long as the  server
waits until the status of the endpoint changes from disconnect_pending
to  disconnected before processing the client connect request then the
connection can be  reestablished and data transmissions restarted.
    Does anyone know why it is taking so long for the server end point
to disconnect or why  the retry count  change did not seem to be
accepted?

Thanks in advance for any help,
Mark Kowalski



More information about the general mailing list