[openib-general] [PATCH] Missing check for atomic_dec in ib_post_send_mad

Mon Nov 1 17:50:26 PST 2004

BTW, what I mean is that in the following code, put the atomic_inc()
into the find() routine ...

- KK

ib_mad_recv_done_handler()
{
...
	spin_lock_irqsave(&port_priv->reg_lock, flags);
        /* Determine corresponding MAD agent for incoming receive MAD */
        solicited = solicited_mad(recv->header.recv_buf.mad);
        mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad,
                                   solicited);
        if (!mad_agent) {
                spin_unlock_irqrestore(&port_priv->reg_lock, flags);
                printk(KERN_NOTICE PFX "No matching mad agent found for "
                       "received MAD on port %d\n", port_priv->port_num);
        } else {
                atomic_inc(&mad_agent->refcount);
                spin_unlock_irqrestore(&port_priv->reg_lock, flags);
                ib_mad_complete_recv(mad_agent, recv, solicited);
        }
...
}

On Mon, 1 Nov 2004, Krishna Kumar wrote:

> Hi Sean,
>
> I agree on the race between the threads, and this is something that I
> had considered as a separate problem (but now it comes back to haunt
> me :-).
>
> An easier solution for this problem is to make sure that whoever
> gets the agent (ib_mad_recv_done_handler) validate the mad_agent
> before calling us. Basically find_mad_agent can hold a refcnt
> on the agent. Is that correct ? If so, I can make a patch to handle
> races on that front. This code is pretty complicated, so please let
> me know if I have grossly mis-stated something (agents and agent_private,
> and whatnots :-).
>
> Thanks for your feedback,
>
> - KK
>
> On Mon, 1 Nov 2004, Sean Hefty wrote:
>
> > On Mon, 1 Nov 2004 16:38:03 -0800 (PST)
> > Krishna Kumar <krkumar at us.ibm.com> wrote:
> >
> > > Hi Sean,
> > >
> > > I think it is reasonable to have current senders racing with
> > > unregister. The unregister is waiting for all references to drop to
> > > zero before freeing up the resources. It killed the ones waiting for
> > > responses(mad_cancel), killed the ones who are executing in callback
> > > handlers, and finally after dropping the loader's module refcnt, it
> > > waits for the refcnt to drop to zero. These can only be threads which
> > > are actively receiving mad packets and those threads in the process of
> > > sending mad packets while the unregister was going on (and the ones
> > > which fail is the only cause of the problem). Essentially I think the
> > > unregister will hang and not free up the resource.
> >
> > The difference here is that a client is calling into the API at the same
> > time that they are trying to unregister.  The code, even with this
> > change, cannot handle this condition.
> >
> > For example, if the thread calling ib_unregister_mad_agent executes
> > completely before the thread calling ib_post_send_mad runs (or can take
> > a reference on the mad_agent), the mad_agent is no longer valid, and the
> > structure will have been freed.  The thread executing ib_post_send_mad
> > can crash the system at this point.
> >
> > If we want to allow a client to call ib_unregister_mad_agent and
> > ib_post_send_mad simultaneously, then ib_post_send_mad would need to
> > perform some sort of lookup (likely in some global map) to validate the
> > mad_agent.
> >
> > - Sean
> >
> >
>
>