[Openib-windows] [openib-general] [Bug 214] New: IB Stack ASSERTS while handling stale connections.

Fabian Tillier ftillier at silverstorm.com
Tue Aug 29 13:52:28 PDT 2006


Hi Pankaj,

On 8/28/06, bugzilla-daemon at openib.org <bugzilla-daemon at openib.org> wrote:
>
> We are encountering a serious bug in the stack which happens while there is a
> stale connection in the list. Here is the call stack:

<snip>

> The problem seems to be that when in function __rep_handler the following line
> of code fails the check
>
> if( __insert_cep( p_cep ) != p_cep )
>
> This seems to mean we have something stale in the list. We call the function
> status = __process_stale( p_cep );
> which calls the function __process_rej.
> __process_rej then calls __remove_cep which tries to remove the p_cep from
> list.

Good catch.  When saving the REP information into the CEP, the remote
QP and remote comm ID is set to non-zero values.  These values are
checked in __remove_cep.  The bug here is that the values should be
cleared if insertion failed.

I checked in a fix for this particular case in revision 466, but then
immediately realized that all cases of __insert_cep failing would need
to handle things properly.  I found a bug in the way __insert_cep
handles errors, and fixed that in revision 467.

> We think the problem is right here. This is the pointer to the new p_cep which
> was never inserted in the list because the check in _insert_cep function
> failed.
> Now instead of removing the old p_cep from the list, we are removing the new
> one. The cl_rbmap_remove_item function doest really validate the pointer given
> to it and always assumes the item was in the list.

Actually, the assertion is catching that the CEP is not in the map, as
it should.  We don't want to remove the old CEP here, it's just that
the remote QPN and comm ID are non-zero, so it expects the CEP to be
in the maps.

> This also begs the question that why was an item present in the list already.
> We are seeing this behavior when we try to make q-pairs to a target repeatedly
> i.e create a q-pair and then destroy it and then re-create it. It seems like if
> we recreate the q-pair within a few seconds (3) then the probelem happens and
> if we wait for 5-10 seconds the problem seems to go away.

When you disconnect a QP, it should go into a timewait state.  The
time period for timewait depends on the packet lifetime and CA ACK
delay, but the intention is to make sure that all packets that might
be on the wire for one connection are transmitted in the fabric before
the QP can be reused for another connection.

I suspect that the target is reusing the QP or connection identifier
too soon, violating the CM protocol.

> Is there a design limitation with the stack that a q-pair connection to the
> same target can not be made again with a certain time period? If yes what is
> the time perio.

The time period is 2 x packet_lifetime + remote CA ack delay, rounded
up.  3 seconds sounds like a long time, though.

> If not, what should we be doing to ensure proper cleanup?

Does the target's CM enforce timewait?

> I guess even if there was a limitation there is still a bug here that the stack
> should be able to handle.

Yes, there most definitely was a bug.  Thanks for finding it and
repoting it.  The bug report was very good, thank you.

- Fab




More information about the ofw mailing list