[openib-general] Re: CM stale connection collision.
Sean Hefty
mshefty at ichips.intel.com
Thu Jan 27 14:58:24 PST 2005
Libor Michalek wrote:
> I've been seeing some stale connection collisions, as a result of one
> of my test hosts being rebooted much more frequently then the other.
Tests to handle stale connections are not in the current code.
(There's some commented out portions of it, but the checks aren't where
they need to be.) My plan is to add this when adding in timewait checking.
> Specifically one of my nodes had two connections with the same remote
> communications ID and different local communications IDs, when the remote
> node received a DREQ from this node, a DREQ_RCVD was generated for the
> given local ID whithout checking to see if the remote ID matched, which
> it didn't. Since the remote node was back from a fresh reboot in both
> cases that generated the local ID, the local QPN was the same as well.
Currently the dreq_handler checks the DREQ:remote_comm_id and
remote_qpn. Since you have the same QPN, you're hitting this issue.
If the stale connection tests mentioned above were finished, this
second connection wouldn't have occurred.
> I think that all applicable messages should check both IDs.
This isn't overly difficult to add. My thinking on the CM
implementation was to treat the remote ID as opaque, so that the local
CM didn't need to make any assumptions about how the remote IDs were
assigned or used. I'll add in checks against the remote ID (and reject
if invalid).
More information about the general
mailing list