[ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected

Mon Aug 10 13:48:28 PDT 2009

 > The lockdep report I obtained this morning with a 2.6.30.4 kernel and
 > the two patches applied has been attached to the kernel bugzilla
 > entry. This lockdep report was generated while testing the SRPT target
 > software. I have double checked that the SRPT target implementation
 > does not hold any spinlocks or mutexes while calling functions in the
 > IB core. This means that the SRPT target code cannot have caused any
 > of the reported lock cycles.

Lockdep is not quite so simple as what you checked, but yes, in this
case it does appear to be pointing a real (albeit spectacularly
unlikely) deadlock in the core IB stack:

  ib_cm takes cm_id_priv->lock and calls ib_post_send_mad()
  from there, ib_mad takes mad_agent_priv->lock

  in another context, ib_mad takes mad_agent_priv->lock and does
  cancel_delayed_work(&mad_agent_priv->timed_work) (and internally
  cancel_delayed_work() does del_timer_sync())

  finally, in another context a communication established event can
  occur and generate a callback (in interrupt context) to ib_cm where it
  takes cm_id_priv->lock

So there can be a chain that deadlocks: if the timer for the timed_work
is running on a CPU, and the interrupt for the communication established
event occurs while the timer is running, then that interrupt handler can
try to take cm_id_priv->lock.

However on another CPU, someone could already be holding
cm_id_priv->lock and call into ib_post_send_mad(), and spinning on
mad_agent_priv->lock, while on yet another CPU, someone could be holding
mad_agent_priv->lock and doing cancel_delayed_work().

And that will deadlock waiting in del_timer_sync() since the timer has
been interrupted by an interrupt handler that will spin on a spinlock
that is part of this chain.

I'm not sure what the right fix is.  It does seem to me that this should
be fixed within the ib_mad module, since doing del_timer_sync() within a
spinlocked region seems like the fundamental problem.  However I'm not
sure what the best way to rewrite the ib_mad usage is.

 > By the way, I noticed that while many subsystems in the Linux kernel
 > use event queues to report information to higher software layers, that
 > the IB core makes extensive use of callback functions. The combination
 > of nested locking and callback functions can easily lead to lock
 > inversion. This effect is well known in the operating system world --
 > see e.g. the talk by John Ousterhout about multithreaded versus
 > event-driven software (http://home.pacbell.net/ouster/threads.pdf,
 > 1996).

I'm not sure what you mean by this.  What would be an example of a
subsystem that uses event queues to report information?  I think the
design of the RDMA stack is quite parallel to most other Linux
subsystems, and we don't have anything as deadlock prone as, say, the
network stack's rtnl.

Trying to queue events up instead of calling back from interrupt context
is not all that simple, since one cannot reliably allocate memory, and
one must deal with synchonization with the consuming context etc.  It's
probably at least as deadlock-prone to try and queue as it is to just
call back.

Osterhout's talk certainly makes sense for a certain class of userspace
apps, but he explicitly says that event driven programming only uses one
CPU, and of course userspace doesn't have hard interrupt handlers or
anything like that.  So the kernel is more complex just because the
environment it runs under is a little trickier than what the kernel
provides for userspace.

 - R.