[Openib-windows] Wrong allocation of mads in al_mad_pool.c (user mode) line 800

Fabian Tillier ftillier at silverstorm.com
Wed Apr 5 13:00:50 PDT 2006


Hi Tzachi,

On 4/5/06, Tzachi Dar <tzachid at mellanox.co.il> wrote:
> OK,
>
> So here is another question:
> When running SDP code with many connections simultaneously, I some times
> get an assert with the following code stack:
>
> ChildEBP RetAddr  Args to Child
> f78ae758 8086b5d0 80889f00 00000003 f7717000 nt!DbgBreakPoint
> f78aea48 8086b6f8 baaaf050 baaaf020 0000021c nt!RtlAssert2+0x104
> f78aea64 baaaf234 baaaf050 baaaf020 0000021c nt!RtlAssert+0x18
> f78aea88 baaaecb1 8a834970 89c23a30 89ed5e18 ibal!__reject_mad+0x164
> [q:\projinf1\trunk\core\al\kernel\al_cm_cep.c @ 540]
> f78aead4 baaa93e5 8a834970 89ed5e18 8a6b72a8 ibal!__process_rep+0x531
> [q:\projinf1\trunk\core\al\kernel\al_cm_cep.c @ 1348]
> f78aeb00 baa78465 8a81a168 ffffffff 8a834970
> ibal!__cep_mad_recv_cb+0x1e5
> [q:\projinf1\trunk\core\al\kernel\al_cm_cep.c @ 1885]
>
> In my case the function is called with p_cep->state ==
> CEP_STATE_REQ_SENT
> and the reason is IB_REJ_STALE_CONN.
>
> Actually, while looking at the two functions __process_rep, __reject_mad
>
> it seems that every time that the insert in __process_rep will fail in
> the insert
> (that is if( __insert_cep( p_cep ) != p_cep ))  we will reach an assert.
>
> Can you tell what the problem here is?

The problem is that the code doesn't handle a stale connection from a
REP.  The __reject_mad function needs to handle the CEP being in
CEP_STATE_REQ_SENT, but it doesn't so you hit the assert (which is
there to trap cases that aren't handled!)

It would be interesting to find out why you're hitting this - is it
because the remote CM ID is being reused too soon, or because the QP
is being reused too soon.

If the CM ID is being recycled too fast, this is likely a bug in the
CM.  If the QPN is being reused too fast, it's either an HCA or a ULP
issue.

I'll code up a patch that fixes this, and adds a debug print for the
case of finding a duplicate.  When you run it, please let me know
which of the two cases you hit.

Thanks,

- Fab



More information about the ofw mailing list