[openib-general] Stale CM callbacks

Sreevatsa Nagarajarao vatsa at veritas.com
Mon Jan 15 18:11:31 PST 2007


Hi,

> If you are seeing any issues with stale connections, please let me know.
> It's possible that the cm is not handling things correctly.

It seems that when the reset node comes back up and tries to set up a
connection with a remote node, it may get a number of IB_CM_REP_ERROR,
or IB_CM_REQ_ERROR events before establishing a successful connection.
At the same time the remote node gets almost no errors. Is it because
the remote node would have destroyed the qpairs and explicitly called
ib_send_cm_dreq() earlier when it determines that a node has gone down ?

Also, in a multi-node cluster, the time when a connect between the reset
node and any other remote node succeeds can vary considerably because of
the above errors. We have been experiementing with some of the paramters
(max_cm_retries, retry_count) to ib_send_cm_req() and ib_send_cm_rep()
but without success.

This behaviour is preventing the cluster ports from forming within some
stipulated time in our environment. We don't see these issues if we
reboot a node instead of reseting it.

Please let me know if you have any suggestions for us.

Thanks,
Sreevatsa

> 
> - Sean




More information about the general mailing list