[ofa-general] Re: some questions on stale connection handling at the IB CM

Or Gerlitz ogerlitz at voltaire.com
Tue Dec 18 00:53:36 PST 2007


Sean Hefty wrote:
> IB_CM_REJ_STALE_CONN is sent in the following situations:
> * Remote ID in REQ matches a connection that is in timewait.  This is treated as
> a duplicate REQ that was processed after the connection had been terminated.
> * Remote QPN in REQ or REP matches an existing connection, and REQ/REP was not
> detected as a duplicate.

OK, thanks for the clarification.

>> On the other side, when the CM receives a reject message with that reason, the
>> local handle (id) is moved to the timewait state, where my understanding is that
>> it will sit there for a while and then a reject/stale-connection callback will be 
>> delivered to the user, the id will be removed.

> correct

I don't see what the user can do for the case of the CM detecting a 
remote qpn match, if they will continue to use the same qpn this will 
happen in an endless loop, correct?

> This is missing.  But neither the DREQ or DREP that are generated in this case
> drive the state machines.  Both messages are simply generated and then consumed
> by the CM.  (I don't even think it's clear if the local and remote IDs in the
> DREQ/DREP are relative to the stale connection, or the new connection
> request/reply.)

I agree that its quite unclear from the spec if the IDs to be used in 
the DREQ are those of the new connection or the stale one. Specifically, 
those of the stale connection might not exist anymore in the CM that 
gets the dreq and it would be just dropped, so there's no real gain in 
implementing this.

> Correct - keep-alive messages are still needed by apps to know if their
> connections are still valid.  IMO, stale connection detection becomes less
> useful as the number of systems being connected to increase.

Is there anything the IB stack can do here to make apps coding simpler? 
In the past I was suggesting to use inform info "GID out" registration 
by the IB CM to catch remote ports going down, but thinking on it again, 
when a port goes down an RC QP pair doesn't, unless there was inflight 
data, so if the CM will deliver disconnect event it might be false 
alarm... and this registration would cause load on the SA so it does not 
scale well, unless we make it a feature of the CM which users would 
enable on target nodes and not initiators...

Or.




More information about the general mailing list