[openib-general] Stale CM callbacks
Sreevatsa Nagarajarao
vatsa at veritas.com
Mon Jan 15 18:11:31 PST 2007
Hi,
> If you are seeing any issues with stale connections, please let me know.
> It's possible that the cm is not handling things correctly.
It seems that when the reset node comes back up and tries to set up a
connection with a remote node, it may get a number of IB_CM_REP_ERROR,
or IB_CM_REQ_ERROR events before establishing a successful connection.
At the same time the remote node gets almost no errors. Is it because
the remote node would have destroyed the qpairs and explicitly called
ib_send_cm_dreq() earlier when it determines that a node has gone down ?
Also, in a multi-node cluster, the time when a connect between the reset
node and any other remote node succeeds can vary considerably because of
the above errors. We have been experiementing with some of the paramters
(max_cm_retries, retry_count) to ib_send_cm_req() and ib_send_cm_rep()
but without success.
This behaviour is preventing the cluster ports from forming within some
stipulated time in our environment. We don't see these issues if we
reboot a node instead of reseting it.
Please let me know if you have any suggestions for us.
Thanks,
Sreevatsa
>
> - Sean
More information about the general
mailing list