[Openib-windows] A problem in ib_close_al
Leonid Keller
leonid at mellanox.co.il
Wed Jul 12 09:12:18 PDT 2006
> -----Original Message-----
> From: ftillier.sst at gmail.com [mailto:ftillier.sst at gmail.com]
> On Behalf Of Fabian Tillier
> Sent: Tuesday, July 11, 2006 9:18 PM
> To: Leonid Keller
> Cc: openib-windows at openib.org
> Subject: Re: [Openib-windows] A problem in ib_close_al
>
> Hi Leo,
>
> On 7/10/06, Leonid Keller <leonid at mellanox.co.il> wrote:
> > Hi Fab,
> > Our regression uses to run opensm and then kill it at some moment.
> > Sometimes opensm enters "zombi" state and there is no way
> to kill it.
> > I've investigated that and found that it is stuck on
> infinite loop in
> > sync_destroy_obj while destroying opensm's instance of AL.
> > I saw ref_cnt = 13 and mad_list of AL object contained 13 records.
>
> The proxy code should free all outstanding MADs before
> calling ib_close_al - see the code path in al_dev_close
> (al_dev.c at 371). The call to __proxy_cleanup_map will free
> all MADs queued internally. Do you know what these MADs are
> from? Are they received MADs that were not delivered to the
> client? Or are they send MADs that are waiting to be sent?
>
I don't know so far how to reproduce that so:
Is there a way to tell received MADs from sent ones apart from adding a
mad_type field anywhere and setting it after every call to ib_get_mad()
?
> > I think, the problem is in that, that mad_list is freed in
> free_al()
> > which can't be called before ref_cnt = 0, which can become
> such only
> > after calling free_al().
>
> Yes, we have a chicken and egg issue here. The free_mads
> function is there to do internal cleanup if we proceed with
> destruction when the reference count doesn't hit zero. Note
> that this can only happen in a debug build (where destruction
> will eventually timeout and blow the object away).
>
> > I've prepared a patch which calls __free_mads() from AL destroying
> > function.
>
> We could call __free_mads from the destroying_al function,
> but this wouldn't necessarilly prevent MADs from being queued
> after that function is called, but before the AL object moves
> to the cleanup phase of destruction (after all references are
> released).
I'm working now on a stuck during shutdown after regression, which (the
stuck) we experiences constantly. The function stuck is
ib_deregister_ca(), which stands in sync_destroy_obj(), waiting for CA
to release it, which is in turn hold by its PD object.
Today I revealed that the PD object is hold by several mads, taken once
from the pool_key of CI_CA. It reminds the problem, being discussed here
and I'll appreciate if you could suggest an idea, why wouldn't these
mads be released by __cleanup_pool_key().
For now I have an idea that probably could solve both problems:
What if we forbid inserting of new MADs while the AL object is being
destroyed ?
Say, by returning an error from ib_get_mad() when
(pool_key->h_al->obj.type == CL_DESTROYING) ?
I can prepare a patch if you don't mind.
>
> A MAD completion that happens just after the destroying_al
> function returns would be able to insert the MAD in the
> tracking list if the AL instance still had some other reference held.
>
> __free_mads must be called from free_al though we could call
> it from destroying_al too, but as I said that doesn't solve
> the problem.
If we can prevent adding new MADs while AL object destroying, it will be
OK to call __free_mads from destroying_al.
>
> Anyhow, I think we need a little more information about what
> MADs are left hanging around - send or receive - so that we
> can focus on the right code paths.
>
> - Fab
>
>
>
More information about the ofw
mailing list