[ofa-general] soft lockup in ipoib

Eli Cohen eli at dev.mellanox.co.il
Sun Jun 29 05:58:23 PDT 2008


On Sun, Jun 29, 2008 at 03:40:57PM +0300, Or Gerlitz wrote:
> Eli, in the same setting of having troubles with the new connectx
> firmware, I see also this soft lockup at ipoib which takes place when
> there's an attempt to probe out the mlx4 driver.
Under what condition does this happen, it not just unloading the
module is it?

> I recall that some fix
> was done lately to wait_for_completion in the mainline kernel, is this
> hang related to that?
I don't know. I would expect that to happen if the CPU is really 
overloaded with work. From what I see bellow, the process is in 'D' state
which I think is uninterruptible and that suggests that it does not
get a completion on a command. However, ofed can handle that since it
checks done->done before deciding if this is a failure.
What are you running?
Did you wait enough time to see if the module evetually unloads?


> 
> ipoib         D ffffffff88445ea2     0  2820    105          2954  2809 (L-TLB)
>  ffff8102236cbd90 0000000000000046 ffff8102238b4860 ffff8102238b4980
>  000000000000000a ffff81022c0dd100 ffff810107b94100 00000096b408c7fc
>  00000000000036b9 ffff81022c0dd2e8 ffff810200000001 ffff8102238b4800
> Call Trace:
>  [<ffffffff88445ea2>] :ib_ipoib:ipoib_cm_tx_reap+0x0/0x21b
>  [<ffffffff800610e7>] wait_for_completion+0x79/0xa2
>  [<ffffffff800884ac>] default_wake_function+0x0/0xe
>  [<ffffffff8842d18f>] :ib_cm:cm_enter_timewait+0xa5/0xb5
>  [<ffffffff8842de30>] :ib_cm:cm_destroy_id+0x1cf/0x209
>  [<ffffffff88445f20>] :ib_ipoib:ipoib_cm_tx_reap+0x7e/0x21b
>  [<ffffffff8004b2ab>] run_workqueue+0x94/0xe5
>  [<ffffffff80047c09>] worker_thread+0x0/0x122
>  [<ffffffff8009b283>] keventd_create_kthread+0x0/0x61
>  [<ffffffff80047cf9>] worker_thread+0xf0/0x122
>  [<ffffffff800884ac>] default_wake_function+0x0/0xe
>  [<ffffffff8009b283>] keventd_create_kthread+0x0/0x61
>  [<ffffffff8009b283>] keventd_create_kthread+0x0/0x61
>  [<ffffffff800321d8>] kthread+0xfe/0x132
>  [<ffffffff8005bfb1>] child_rip+0xa/0x11
>  [<ffffffff8009b283>] keventd_create_kthread+0x0/0x61
>  [<ffffffff800320da>] kthread+0x0/0x132
>  [<ffffffff8005bfa7>] child_rip+0x0/0x11
> 
> rmmod         D 0000000000000880     0  5875   5874                     (NOTLB)
>  ffff810218085d88 0000000000000082 ffff81022d6961c0 ffff81022d1efc00
>  0000000000000007 ffff81022e9457e0 ffff81022fc16100 0000023f4abf6f5f
>  000000000000db90 ffff81022e9459c8 ffff810200000003 0000000000000246
> Call Trace:
>  [<ffffffff80098556>] flush_cpu_workqueue+0x7f/0xad
>  [<ffffffff8009b446>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff80098592>] flush_workqueue+0xe/0x85
>  [<ffffffff88443081>] :ib_ipoib:ipoib_mcast_stop_thread+0x93/0x9a
>  [<ffffffff88440486>] :ib_ipoib:ipoib_ib_dev_cleanup+0x2f/0x40
>  [<ffffffff8843e5d6>] :ib_ipoib:ipoib_dev_cleanup+0x78/0xaf
>  [<ffffffff8843e66c>] :ib_ipoib:ipoib_remove_one+0x5f/0x96
>  [<ffffffff8838afa4>] :ib_core:ib_unregister_device+0x30/0xc4
>  [<ffffffff883de87d>] :mlx4_ib:mlx4_ib_remove+0x47/0x7c
>  [<ffffffff88161280>] :mlx4_core:mlx4_remove_device+0x44/0x6a
>  [<ffffffff881613a7>] :mlx4_core:mlx4_unregister_interface+0x29/0x68
>  [<ffffffff883e42e0>] :mlx4_ib:mlx4_ib_cleanup+0x10/0x34
>  [<ffffffff800a0d35>] sys_delete_module+0x196/0x1c5
>  [<ffffffff8005b28d>] tracesys+0xd5/0xe0
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list