[ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected

Bart Van Assche bart.vanassche at gmail.com
Tue Aug 11 13:29:42 PDT 2009


On Mon, Aug 10, 2009 at 10:48 PM, Roland Dreier<rdreier at cisco.com> wrote:
>
>  > The lockdep report I obtained this morning with a 2.6.30.4 kernel and
>  > the two patches applied has been attached to the kernel bugzilla
>  > entry. This lockdep report was generated while testing the SRPT target
>  > software. I have double checked that the SRPT target implementation
>  > does not hold any spinlocks or mutexes while calling functions in the
>  > IB core. This means that the SRPT target code cannot have caused any
>  > of the reported lock cycles.
>
> Lockdep is not quite so simple as what you checked, but yes, in this
> case it does appear to be pointing a real (albeit spectacularly
> unlikely) deadlock in the core IB stack:
>
>  ib_cm takes cm_id_priv->lock and calls ib_post_send_mad()
>  from there, ib_mad takes mad_agent_priv->lock
>
>  in another context, ib_mad takes mad_agent_priv->lock and does
>  cancel_delayed_work(&mad_agent_priv->timed_work) (and internally
>  cancel_delayed_work() does del_timer_sync())
>
>  finally, in another context a communication established event can
>  occur and generate a callback (in interrupt context) to ib_cm where it
>  takes cm_id_priv->lock
>
> So there can be a chain that deadlocks: if the timer for the timed_work
> is running on a CPU, and the interrupt for the communication established
> event occurs while the timer is running, then that interrupt handler can
> try to take cm_id_priv->lock.
>
> However on another CPU, someone could already be holding
> cm_id_priv->lock and call into ib_post_send_mad(), and spinning on
> mad_agent_priv->lock, while on yet another CPU, someone could be holding
> mad_agent_priv->lock and doing cancel_delayed_work().
>
> And that will deadlock waiting in del_timer_sync() since the timer has
> been interrupted by an interrupt handler that will spin on a spinlock
> that is part of this chain.
>
> I'm not sure what the right fix is.  It does seem to me that this should
> be fixed within the ib_mad module, since doing del_timer_sync() within a
> spinlocked region seems like the fundamental problem.  However I'm not
> sure what the best way to rewrite the ib_mad usage is.

It's already good news that the potential lock cycle has been deduced
from the lockdep reports. I know that it can take a lot of work to
analyze such reports.

Even if it is really unlikely that this lock cycle would cause a
deadlock, it would be great if this lock cycle could be removed. I'm
not the only developer of kernel modules who runs tests with lockdep
enabled, and it is unpractical to analyze long logfiles full of known
lock cycles to find a single lock cycle caused by newly added or
recently modified code.

>  > By the way, I noticed that while many subsystems in the Linux kernel
>  > use event queues to report information to higher software layers, that
>  > the IB core makes extensive use of callback functions. The combination
>  > of nested locking and callback functions can easily lead to lock
>  > inversion. This effect is well known in the operating system world --
>  > see e.g. the talk by John Ousterhout about multithreaded versus
>  > event-driven software (http://home.pacbell.net/ouster/threads.pdf,
>  > 1996).
>
> I'm not sure what you mean by this.  What would be an example of a
> subsystem that uses event queues to report information?  I think the
> design of the RDMA stack is quite parallel to most other Linux
> subsystems, and we don't have anything as deadlock prone as, say, the
> network stack's rtnl.

What I had in mind as an example is the netlink socket mechanism,
although this is a mechanism for sending notifications from the kernel
to userspace.

> Trying to queue events up instead of calling back from interrupt context
> is not all that simple, since one cannot reliably allocate memory, and
> one must deal with synchonization with the consuming context etc.  It's
> probably at least as deadlock-prone to try and queue as it is to just
> call back.

One possible approach when having to queue events from interrupt
context is to queue these events in a fixed size queue that has been
allocated outside interrupt context, and make it possible for the
event consumer to detect the queue overflow condition. When a queue
overflow happens it is the responsibility of the event consumer to
query the state of the event producer. This is a more complex approach
than callback functions but has the advantage that there never can be
a lock cycle involving locks of both the event producer and the event
consumer.

I'm not inventing anything new here -- this is exactly how netlink sockets work.

Bart.



More information about the general mailing list