[ofa-general] Re: 2.6.30.1: possible irq lock inversion dependency detected

Bart Van Assche bart.vanassche at gmail.com
Thu Jul 30 04:00:49 PDT 2009


On Fri, Jul 10, 2009 at 10:42 PM, Roland Dreier <rdreier at cisco.com> wrote:
>
>  > Thanks for the patch. With the patch applied the lockdep warning
>  > indeed occurs sooner and the output is now indeed shorter. You can
>  > find the new lockdep output here:
>  > http://bugzilla.kernel.org/attachment.cgi?id=22305.
>
> Thanks, that actually looks like a completely different issue (that I
> can actually understand).  I was able to reproduce that here: the issue
> is doing skb_orphan() inside of priv->lock, and the network stack
> locking is not irq-safe.  So the following hacky patch fixes that.
>
> This would be a short-term solution for the immediate issue at least.  A
> better solution would be if we didn't need to make priv->lock
> hardirq-safe: the only place that requires it is the QP event handler in
> ipoib_cm.c, and that might be a little dicy to fix.  Need to think about that.
>
> However with this patch applied I don't see any further lockdep reports
> here.  It would be great if you could retest yet again with this applied
> (on top of my earlier patch to make priv->lock hardirq-safe as early as
> possible).

Hello Roland,

Sorry but I'm afraid that the two kernel patches posted in this thread
are not sufficient to fix all outstanding locking issues in 2.6.30 IB
subsystem. I encountered the following kernel messages today:

OpenSM[8074]: SM port is down
OpenSM[8074]: SM port is down
OpenSM[8074]: SM port is down
OpenSM[8074]: Entering MASTER state
ib_srpt: ASYNC event= 17 on device= mlx4_0
ib_srpt: ASYNC event= 11 on device= mlx4_0
ib_srpt: ASYNC event= 9 on device= mlx4_0
OpenSM[8074]: SUBNET UP
ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready

======================================================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
2.6.30.3-scst-debug #1
------------------------------------------------------
firefox/4069 [HC0[0]:SC1[2]:HE0:SE0] is trying to acquire:
 (&mad_agent_priv->lock){..-...}, at: [<ffffffffa04395e2>]
ib_post_send_mad+0xe2/0x7d0 [ib_mad]

and this task is already holding:
 (&priv->lock){-.-...}, at: [<ffffffffa047bb4d>]
ipoib_path_lookup+0x4d/0x2f0 [ib_ipoib]
which would create a new lock dependency:
 (&priv->lock){-.-...} -> (&mad_agent_priv->lock){..-...}

but this new dependency connects a HARDIRQ-irq-safe lock:
 (&priv->lock){-.-...}
... which became HARDIRQ-irq-safe at:
  [<ffffffffffffffff>] 0xffffffffffffffff

to a HARDIRQ-irq-unsafe lock:
 (&(&mad_agent_priv->timed_work)->timer){+.-...}
... which became HARDIRQ-irq-unsafe at:
...  [<ffffffffffffffff>] 0xffffffffffffffff

[ ... ]

stack backtrace:
Pid: 4069, comm: firefox Not tainted 2.6.30.3-scst-debug #1
Call Trace:
 <IRQ>  [<ffffffff8027352a>] check_usage+0x3ba/0x470
 [<ffffffff80273644>] check_irq_usage+0x64/0x100
 [<ffffffff802746d9>] __lock_acquire+0xff9/0x1c80
 [<ffffffff80275468>] lock_acquire+0x108/0x150
 [<ffffffffa04395e2>] ? ib_post_send_mad+0xe2/0x7d0 [ib_mad]
 [<ffffffff80515061>] _spin_lock_irqsave+0x41/0x60
 [<ffffffffa04395e2>] ? ib_post_send_mad+0xe2/0x7d0 [ib_mad]
 [<ffffffffa04395e2>] ib_post_send_mad+0xe2/0x7d0 [ib_mad]
 [<ffffffff8037c39c>] ? idr_get_new_above_int+0x1c/0x90
 [<ffffffffa04659d4>] send_mad+0xb4/0x110 [ib_sa]
 [<ffffffffa04223ef>] ? ib_pack+0x17f/0x210 [ib_core]
 [<ffffffffa046613d>] ib_sa_path_rec_get+0x1ed/0x260 [ib_sa]
 [<ffffffffa047afa9>] path_rec_start+0x89/0xf0 [ib_ipoib]
 [<ffffffffa047bdf0>] ? path_rec_completion+0x0/0x540 [ib_ipoib]
 [<ffffffffa047bdc9>] ipoib_path_lookup+0x2c9/0x2f0 [ib_ipoib]
 [<ffffffffa047c5bd>] ipoib_start_xmit+0x17d/0x440 [ib_ipoib]
 [<ffffffff80488bfd>] dev_hard_start_xmit+0x2bd/0x340
 [<ffffffff80488997>] ? dev_hard_start_xmit+0x57/0x340
 [<ffffffff8049d4be>] __qdisc_run+0x25e/0x2b0
 [<ffffffff804890a0>] dev_queue_xmit+0x2f0/0x4c0
 [<ffffffff80488e02>] ? dev_queue_xmit+0x52/0x4c0
 [<ffffffff8048f489>] neigh_connected_output+0xa9/0xe0
 [<ffffffff804911f5>] neigh_update+0x265/0x510
 [<ffffffff804909f9>] ? neigh_lookup+0x129/0x160
 [<ffffffff804d4332>] arp_process+0x392/0x8c0
 [<ffffffff804d3fa0>] ? arp_process+0x0/0x8c0
 [<ffffffff802726bd>] ? trace_hardirqs_on_caller+0x6d/0x1a0
 [<ffffffff804d4989>] arp_rcv+0x119/0x130
 [<ffffffff80487892>] netif_receive_skb+0x392/0x4e0
 [<ffffffff80487610>] ? netif_receive_skb+0x110/0x4e0
 [<ffffffffa047dfd6>] ipoib_ib_handle_rx_wc+0x166/0x2a0 [ib_ipoib]
 [<ffffffffa047f771>] ipoib_poll+0x181/0x1e0 [ib_ipoib]
 [<ffffffff80485fda>] net_rx_action+0x17a/0x260
 [<ffffffff80485f53>] ? net_rx_action+0xf3/0x260
 [<ffffffff8024ef49>] ? __do_softirq+0x59/0x230
 [<ffffffff8024efdf>] __do_softirq+0xef/0x230
 [<ffffffff8020d0fc>] call_softirq+0x1c/0x30
 [<ffffffff8020ee95>] do_softirq+0x75/0xb0
 [<ffffffff8024eaa5>] irq_exit+0x95/0xa0
 [<ffffffff8020e61d>] do_IRQ+0x8d/0xf0
 [<ffffffff8020c913>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff80514cf1>] ? _spin_unlock_irq+0x31/0x60
 [<ffffffff8023fcc9>] ? finish_task_switch+0x89/0x110
 [<ffffffff8023fc86>] ? finish_task_switch+0x46/0x110
 [<ffffffffa0438d00>] ? ib_mad_completion_handler+0x0/0x800 [ib_mad]
 [<ffffffff80511737>] ? thread_return+0x52/0x85b
 [<ffffffff8051472e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8027279d>] ? trace_hardirqs_on_caller+0x14d/0x1a0
 [<ffffffff8051472e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff80511f53>] ? schedule+0x13/0x40
 [<ffffffff8020ca04>] ? retint_careful+0x12/0x2e
ib0: no IPv6 routers present

Bart.



More information about the general mailing list