[Openib-windows] A problem in ib_close_al

Fabian Tillier ftillier at silverstorm.com
Tue Jul 11 11:18:19 PDT 2006


Hi Leo,

On 7/10/06, Leonid Keller <leonid at mellanox.co.il> wrote:
> Hi Fab,
> Our regression uses to run opensm and then kill it at some moment.
> Sometimes opensm enters "zombi" state and there is no way to kill it.
> I've investigated that and found that it is stuck on infinite loop in
> sync_destroy_obj while destroying opensm's instance of AL.
> I saw ref_cnt = 13 and mad_list of AL object contained 13 records.

The proxy code should free all outstanding MADs before calling
ib_close_al - see the code path in al_dev_close (al_dev.c at 371).  The
call to __proxy_cleanup_map will free all MADs queued internally.  Do
you know what these MADs are from?  Are they received MADs that were
not delivered to the client?  Or are they send MADs that are waiting
to be sent?

> I think, the problem is in that, that mad_list is freed in free_al()
> which can't be called before ref_cnt = 0, which can become such only
> after calling free_al().

Yes, we have a chicken and egg issue here.  The free_mads function is
there to do internal cleanup if we proceed with destruction when the
reference count doesn't hit zero.  Note that this can only happen in a
debug build (where destruction will eventually timeout and blow the
object away).

> I've prepared a patch which calls __free_mads() from AL destroying
> function.

We could call __free_mads from the destroying_al function, but this
wouldn't necessarilly prevent MADs from being queued after that
function is called, but before the AL object moves to the cleanup
phase of destruction (after all references are released).

A MAD completion that happens just after the destroying_al function
returns would be able to insert the MAD in the tracking list if the AL
instance still had some other reference held.

__free_mads must be called from free_al though we could call it from
destroying_al too, but as I said that doesn't solve the problem.

Anyhow, I think we need a little more information about what MADs are
left hanging around - send or receive - so that we can focus on the
right code paths.

- Fab




More information about the ofw mailing list