[ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs

Chris Worley worleys at gmail.com
Thu Sep 3 08:35:49 PDT 2009


On Wed, Sep 2, 2009 at 11:59 PM, Bart Van
Assche<bart.vanassche at gmail.com> wrote:
> On Thu, Sep 3, 2009 at 6:08 AM, Chris Worley<worleys at gmail.com> wrote:
>> In trying to build a kernel w/ lockdep support as Vlad requested, my
>> lack of Debian knowledge shone through, and, although I believe I
>> followed all the instructions correctly, I'm not sure if I have a
>> 2.6.28-15 or 2.6.28-10 kernel.  Anyway, the issue is still repeatable.
>>
>> Whatever kernel that is, I have SRP hung currently.  What should I
>> look for in /proc/lockd*?
>
> Lockdep sends its error messages to the kernel ring buffer (dmesg /
> /var/log/messages). An example of an error message generated by
> lockdep can be found here: http://lkml.org/lkml/2007/5/22/443.

The fio hang shows no lock issues (all 64 fio processes are in this state):

[ 3715.150077] INFO: task fio:27967 blocked for more than 120 seconds.
[ 3715.150131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 3715.150183] fio           D ffff8803c183baa0     0 27967      1
[ 3715.150187]  ffff8803c183ba38 0000000000000046 ffff8803c450a920
0000000000000001
[ 3715.150191]  ffff8803c183b9a8 ffff8803c450a1f0 ffff8803c18ec3e0
ffff8803c450a560
[ 3715.150194]  00000001c450a8e8 0000000000000001 ffff8803c450a560
ffffffff80219c49
[ 3715.150197] Call Trace:
[ 3715.150205]  [<ffffffff80219c49>] ? sched_clock+0x9/0x10
[ 3715.150210]  [<ffffffff804105d2>] ? get_request_wait+0xb2/0x1b0
[ 3715.150214]  [<ffffffff806a98fb>] ? _spin_unlock_irq+0x2b/0x40
[ 3715.150218]  [<ffffffff806a6d87>] io_schedule+0x37/0x50
[ 3715.150220]  [<ffffffff804105d7>] get_request_wait+0xb7/0x1b0
[ 3715.150224]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[ 3715.150226]  [<ffffffff8040bee2>] ? elv_merge+0x32/0x1e0
[ 3715.150228]  [<ffffffff8041075e>] __make_request+0x8e/0x4c0
[ 3715.150231]  [<ffffffff8027a9fc>] ? __lock_acquire+0x33c/0x1250
[ 3715.150234]  [<ffffffff8040eebe>] generic_make_request+0x37e/0x4c0
[ 3715.150236]  [<ffffffff80219be0>] ? native_sched_clock+0x20/0x80
[ 3715.150238]  [<ffffffff80219c49>] ? sched_clock+0x9/0x10
[ 3715.150240]  [<ffffffff8040f075>] submit_bio+0x75/0xf0
[ 3715.150243]  [<ffffffff803186fc>] ? bio_set_pages_dirty+0x4c/0x70
[ 3715.150246]  [<ffffffff8031ba1e>] dio_bio_submit+0x5e/0x90
[ 3715.150248]  [<ffffffff8031c702>] __blockdev_direct_IO+0x582/0xce0
[ 3715.150250]  [<ffffffff8031b579>] blkdev_direct_IO+0x49/0x50
[ 3715.150252]  [<ffffffff80319ca0>] ? blkdev_get_blocks+0x0/0xc0
[ 3715.150256]  [<ffffffff802b9e31>] generic_file_aio_read+0x6a1/0x6c0
[ 3715.150258]  [<ffffffff8027a9fc>] ? __lock_acquire+0x33c/0x1250
[ 3715.150260]  [<ffffffff80219be0>] ? native_sched_clock+0x20/0x80
[ 3715.150262]  [<ffffffff80219c49>] ? sched_clock+0x9/0x10
[ 3715.150265]  [<ffffffff803253b9>] ? aio_run_iocb+0x49/0x160
[ 3715.150267]  [<ffffffff806a98fb>] ? _spin_unlock_irq+0x2b/0x40
[ 3715.150269]  [<ffffffff802b9790>] ? generic_file_aio_read+0x0/0x6c0
[ 3715.150272]  [<ffffffff80323c1c>] aio_rw_vect_retry+0x7c/0x210
[ 3715.150274]  [<ffffffff80323ba0>] ? aio_rw_vect_retry+0x0/0x210
[ 3715.150276]  [<ffffffff803253fe>] aio_run_iocb+0x8e/0x160
[ 3715.150278]  [<ffffffff80325fa6>] sys_io_submit+0x286/0x700
[ 3715.150281]  [<ffffffff802124aa>] system_call_fastpath+0x16/0x1b
[ 3715.150283] no locks held by fio/27967.

Looking at a dump of all tasks, I don't see a thread labeled "ib_srp",
but here are some of the other threads involved, everybody seems to
just be waiting:

All the AIO threads are waiting, as in:

[43440.068513] aio/7         S ffff88042f855ee0     0    88      2
[43440.068515]  ffff880423c25ec0 0000000000000046 ffff880423c2a1f0
ffff880423c2a1f0
[43440.068518]  ffffffff80267ed9 ffff880423c2a1f0 ffff88042e93c3e0
ffff880423c2a560
[43440.068521]  0000000723c25e60 00000000ffff8d2a ffff880423c2a560
0000000000000282
[43440.068524] Call Trace:
[43440.068526]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.068528]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.068530]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.068532]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.068533]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.068535]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.068537]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.068539]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.068541]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.068543]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11

One scsi_eh thread is in a "kobject_put" state:

[43440.068580] scsi_eh_0     S ffff880423d59ed0     0    92      2
[43440.068582]  ffff880423d59e60 0000000000000046 ffff880423d59dd0
ffffffff80422037
[43440.068585]  ffff8804243b4080 ffff880423c743e0 ffff88042e808000
ffff880423c74750
[43440.068588]  0000000223d59e00 00000000ffff945b ffff880423c74750
0000000000000000
[43440.068591] Call Trace:
[43440.068593]  [<ffffffff80422037>] ? kobject_put+0x27/0x60
[43440.068596]  [<ffffffff804d94d7>] ? __scsi_iterate_devices+0x77/0xa0
[43440.068599]  [<ffffffff804deb0c>] scsi_error_handler+0x8c/0x5d0
[43440.068601]  [<ffffffff8023e8ab>] ? complete+0x4b/0x60
[43440.068603]  [<ffffffff804dea80>] ? scsi_error_handler+0x0/0x5d0
[43440.068605]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.068607]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.068609]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.068611]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.068613]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11

... all the rest are in trace_hardirqs_on_caller (except for one
other, shown later):

[43440.068614] scsi_eh_1     S ffff880423d5bed0     0    93      2
[43440.068616]  ffff880423d5be60 0000000000000046 ffff880423d5bdd0
ffffffff8027a38b
[43440.068619]  ffff8804243b5080 ffff880423c721f0 ffff88042e8fa1f0
ffff880423c72560
[43440.068622]  0000000623d5be00 00000000ffff8d39 ffff880423c72560
0000000000000000
[43440.068625] Call Trace:
[43440.068626]  [<ffffffff8027a38b>] ? trace_hardirqs_on_caller+0x13b/0x1a0
[43440.068628]  [<ffffffff804d94ca>] ? __scsi_iterate_devices+0x6a/0xa0
[43440.068631]  [<ffffffff804deb0c>] scsi_error_handler+0x8c/0x5d0
[43440.068632]  [<ffffffff8023e8ab>] ? complete+0x4b/0x60
[43440.068634]  [<ffffffff804dea80>] ? scsi_error_handler+0x0/0x5d0
[43440.068636]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.068638]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.068640]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.068642]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.068644]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11

The IB stack is in the following state:

[43440.071112] mthca_catas   S ffff88042f855ee0     0  2805      2
[43440.071114]  ffff88042309bec0 0000000000000046 ffff880428c1a1f0
ffff880428c1a1f0
[43440.071117]  ffffffff80267ed9 ffff880428c1a1f0 ffff88042e8fa1f0
ffff880428c1a560
[43440.071120]  000000062309be60 00000000ffff96d1 ffff880428c1a560
0000000000000282
[43440.071123] Call Trace:
[43440.071125]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071127]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071129]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071131]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071133]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071135]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071136]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071138]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071140]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071142]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071143] mlx4_err      S ffff88042f855ee0     0  2808      2
[43440.071146]  ffff88042d4ebec0 0000000000000046 ffff880428c1c3e0
ffff880428c1c3e0
[43440.071149]  ffffffff80267ed9 ffff880428c1c3e0 ffff880428c18000
ffff880428c1c750
[43440.071152]  000000032d4ebe60 ffffffff8027a38b ffff880428c1c750
0000000000000282
[43440.071155] Call Trace:
[43440.071157]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071159]  [<ffffffff8027a38b>] ? trace_hardirqs_on_caller+0x13b/0x1a0
[43440.071160]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071162]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071164]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071166]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071168]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071170]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071172]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071173]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071175]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071177] ib_mad1       S ffff88042f855ee0     0  2815      2
[43440.071179]  ffff88042c4f3ec0 0000000000000046 ffff88042794c3e0
ffff88042794c3e0
[43440.071182]  ffffffff80267ed9 ffff88042794c3e0 ffff88042efc43e0
ffff88042794c750
[43440.071185]  000000012c4f3e60 000000010041d148 ffff88042794c750
0000000000000282
[43440.071188] Call Trace:
[43440.071190]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071191]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071193]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071195]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071197]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071199]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071201]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071202]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071204]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071206]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071208] ib_mad2       S ffff88042f855ee0     0  2816      2
[43440.071210]  ffff880422431ec0 0000000000000046 ffff88042c5343e0
ffff88042c5343e0
[43440.071213]  ffffffff80267ed9 ffff88042c5343e0 ffff8804283fa1f0
ffff88042c534750
[43440.071216]  0000000622431e60 ffffffff8027a38b ffff88042c534750
0000000000000282
[43440.071219] Call Trace:
[43440.071221]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071223]  [<ffffffff8027a38b>] ? trace_hardirqs_on_caller+0x13b/0x1a0
[43440.071224]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071226]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071228]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071230]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071232]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071234]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071235]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071237]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071239]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071241] ib_mcast      S ffff88042f855ee0     0  2837      2
[43440.071243]  ffff88042b783ec0 0000000000000046 ffff88042b6e0000
ffff88042b6e0000
[43440.071246]  ffffffff80267ed9 ffff88042b6e0000 ffff88042e8fa1f0
ffff88042b6e0370
[43440.071249]  000000062b783e60 00000000ffff97c1 ffff88042b6e0370
0000000000000282
[43440.071252] Call Trace:
[43440.071254]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071256]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071257]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071259]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071261]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071263]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071265]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071267]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071269]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071270]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071272] ib_cm/0       S ffff88042f855ee0     0  2840      2
[43440.071274]  ffff88042b79fec0 0000000000000046 ffff88042c5321f0
ffff88042c5321f0
[43440.071277]  ffffffff80267ed9 ffff88042c5321f0 ffff8803c44343e0
ffff88042c532560
[43440.071280]  000000002b79fe60 ffffffff8027a38b ffff88042c532560
0000000000000282
[43440.071283] Call Trace:
[43440.071285]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071287]  [<ffffffff8027a38b>] ? trace_hardirqs_on_caller+0x13b/0x1a0
[43440.071288]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071290]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071292]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071294]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071296]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071298]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071299]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071301]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071303]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071305] ib_cm/1       S ffff88042f855ee0     0  2841      2
[43440.071307]  ffff88042b4d5ec0 0000000000000046 ffff88042b4d8000
ffff88042b4d8000
[43440.071310]  ffffffff80267ed9 ffff88042b4d8000 ffff88042efc43e0
ffff88042b4d8370
[43440.071313]  000000012b4d5e60 000000010002d3a4 ffff88042b4d8370
0000000000000282
[43440.071316] Call Trace:
[43440.071318]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071320]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071322]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071324]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071325]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071327]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071329]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071331]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071333]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071334]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071336] ib_cm/2       S ffff88042f855ee0     0  2842      2
[43440.071338]  ffff88042b4d7ec0 0000000000000046 ffff88042b4da1f0
ffff88042b4da1f0
[43440.071341]  ffffffff80267ed9 ffff88042b4da1f0 ffff88042e808000
ffff88042b4da560
[43440.071344]  000000022b4d7e60 00000000ffff97c2 ffff88042b4da560
0000000000000282
[43440.071347] Call Trace:
[43440.071349]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071351]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071353]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071355]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071356]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071358]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071360]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071362]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071364]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071365]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071367] ib_cm/3       S ffff88042f855ee0     0  2843      2
[43440.071369]  ffff88042b559ec0 0000000000000046 ffff88042b4dc3e0
ffff88042b4dc3e0
[43440.071372]  ffffffff80267ed9 ffff88042b4dc3e0 ffff88042e83a1f0
ffff88042b4dc750
[43440.071375]  000000032b559e60 00000000ffff97c2 ffff88042b4dc750
0000000000000282
[43440.071378] Call Trace:
[43440.071380]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071382]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071384]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071386]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071388]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071390]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071391]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071393]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071395]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071397]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071398] ib_cm/4       S ffff88042f855ee0     0  2844      2
[43440.071401]  ffff88042b55bec0 0000000000000046 ffff88042b550000
ffff88042b550000
[43440.071404]  ffffffff80267ed9 ffff88042b550000 ffff88042e87c3e0
ffff88042b550370
[43440.071407]  000000042b55be60 00000000ffff97c2 ffff88042b550370
0000000000000282
[43440.071410] Call Trace:
[43440.071411]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071413]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071415]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071417]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071419]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071421]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071423]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071424]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071426]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071428]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071429] ib_cm/5       S ffff88042f855ee0     0  2845      2
[43440.071432]  ffff88042b541ec0 0000000000000046 ffff88042b5521f0
ffff88042b5521f0
[43440.071435]  ffffffff80267ed9 ffff88042b5521f0 ffff88042e8c0000
ffff88042b552560
[43440.071438]  000000052b541e60 000000010004ef98 ffff88042b552560
0000000000000282
[43440.071440] Call Trace:
[43440.071442]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071444]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071446]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071448]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071450]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071452]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071453]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071455]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071457]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071459]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071460] ib_cm/6       S ffff88042f855ee0     0  2846      2
[43440.071462]  ffff88042b543ec0 0000000000000046 ffff88042b5543e0
ffff88042b5543e0
[43440.071465]  ffffffff80267ed9 ffff88042b5543e0 ffff88042e8fa1f0
ffff88042b554750
[43440.071468]  000000062b543e60 00000000ffff97c2 ffff88042b554750
0000000000000282
[43440.071471] Call Trace:
[43440.071472]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071474]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071476]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071478]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071480]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071482]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071484]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071486]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071487]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071489]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071490] ib_cm/7       S ffff88042f855ee0     0  2847      2
[43440.071493]  ffff88042b545ec0 0000000000000046 ffff88042b530000
ffff88042b530000
[43440.071496]  ffffffff80267ed9 ffff88042b530000 ffff88042e93c3e0
ffff88042b530370
[43440.071499]  000000072b545e60 00000000ffff97c2 ffff88042b530370
0000000000000282
[43440.071502] Call Trace:
[43440.071504]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071505]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071507]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071509]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071511]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071513]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071515]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071517]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071519]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071520]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071522] ipoib         S ffff88042f855ee0     0  2853      2
[43440.071524]  ffff88042b5f9ec0 0000000000000046 ffff88042b5321f0
ffff88042b5321f0
[43440.071527]  ffffffff80267ed9 ffff88042b5321f0 ffff88042e988000
ffff88042b532560
[43440.071530]  000000032b5f9e60 ffffffff8027a38b ffff88042b532560
0000000000000282
[43440.071533] Call Trace:
[43440.071535]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071537]  [<ffffffff8027a38b>] ? trace_hardirqs_on_caller+0x13b/0x1a0
[43440.071539]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071541]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071543]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071544]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071546]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071548]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071550]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071552]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071553]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071555] ib_addr       S ffff88042f855ee0     0  2870      2
[43440.071557]  ffff88042ace9ec0 0000000000000046 ffff88042b5343e0
ffff88042b5343e0
[43440.071560]  ffffffff80267ed9 ffff88042b5343e0 ffff88042e8c0000
ffff88042b534750
[43440.071563]  000000052ace9e60 0000000100419d43 ffff88042b534750
0000000000000282
[43440.071566] Call Trace:
[43440.071568]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071570]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071572]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071574]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071575]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071577]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071579]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071581]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071583]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071585]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071586] iw_cm_wq      S ffff88042f855ee0     0  2872      2
[43440.071589]  ffff880428373ec0 0000000000000046 ffff88042d8b21f0
ffff88042d8b21f0
[43440.071592]  ffffffff80267ed9 ffff88042d8b21f0 ffff880423130000
ffff88042d8b2560
[43440.071595]  0000000328373e60 ffffffff8027a38b ffff88042d8b2560
0000000000000282
[43440.071598] Call Trace:
[43440.071599]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071601]  [<ffffffff8027a38b>] ? trace_hardirqs_on_caller+0x13b/0x1a0
[43440.071603]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071605]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071607]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071609]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071611]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071613]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071614]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071616]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071618]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11
[43440.071619] rdma_cm       S ffff88042f855ee0     0  2874      2
[43440.071622]  ffff880429917ec0 0000000000000046 ffff88042c4ec3e0
ffff88042c4ec3e0
[43440.071625]  ffffffff80267ed9 ffff88042c4ec3e0 ffff880428c343e0
ffff88042c4ec750
[43440.071628]  0000000629917e60 ffffffff8027a38b ffff88042c4ec750
0000000000000282
[43440.071631] Call Trace:
[43440.071632]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.071634]  [<ffffffff8027a38b>] ? trace_hardirqs_on_caller+0x13b/0x1a0
[43440.071636]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.071638]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.071640]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.071642]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.071644]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.071646]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.071647]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.071649]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.071651]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11

scsi_tgtd threads all look like:

[43440.075668] scsi_tgtd/7   S ffff88042f855ee0     0 27784      2
[43440.075668]  ffff880427057ec0 0000000000000046 ffff880427f043e0
ffff880427f043e0
[43440.075668]  ffffffff80267ed9 ffff880427f043e0 ffff88042e93c3e0
ffff880427f04750
[43440.075668]  0000000727057e60 000000010002c6f4 ffff880427f04750
0000000000000282
[43440.075668] Call Trace:
[43440.075668]  [<ffffffff80267ed9>] ? prepare_to_wait+0x49/0x80
[43440.075668]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.075668]  [<ffffffff802635cd>] worker_thread+0xed/0x130
[43440.075668]  [<ffffffff80267c60>] ? autoremove_wake_function+0x0/0x40
[43440.075668]  [<ffffffff802634e0>] ? worker_thread+0x0/0x130
[43440.075668]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.075668]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.075668]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.075668]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.075668]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11

Not sure what this thread is:

[43440.075668] ib_fmr(mlx4_0 S ffff88042f855ee0     0 27790      2
[43440.075668]  ffff8803c18a9ef0 0000000000000046 ffffffff80bf8a00
0000000000000000
[43440.075668]  ffff880423eff080 ffff880427e9a1f0 ffff88042e8fa1f0
ffff880427e9a560
[43440.075668]  0000000627e9a1f0 000000010002c6f7 ffff880427e9a560
ffff880427e9a1f0
[43440.075668] Call Trace:
[43440.075668]  [<ffffffffa010bfa5>] ib_fmr_cleanup_thread+0xb5/0xd0 [ib_core]
[43440.075668]  [<ffffffffa010bef0>] ? ib_fmr_cleanup_thread+0x0/0xd0 [ib_core]
[43440.075668]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.075668]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.075668]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.075668]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.075668]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11

Another scsi_eh, in a different state:

[43440.075668] scsi_eh_4     S ffff88042713bed0     0 27831      2
[43440.075668]  ffff88042713be60 0000000000000046 ffffffff80244f0f
ffff880427e9c3e0
[43440.075668]  ffffffff806a98fb ffff880427e9c3e0 ffff88042e83a1f0
ffff880427e9c750
[43440.075668]  0000000344840a00 000000010002d3a4 ffff880427e9c750
ffffffff8027a3fd
[43440.075668] Call Trace:
[43440.075668]  [<ffffffff80244f0f>] ? finish_task_switch+0x5f/0x120
[43440.075668]  [<ffffffff806a98fb>] ? _spin_unlock_irq+0x2b/0x40
[43440.075668]  [<ffffffff8027a3fd>] ? trace_hardirqs_on+0xd/0x10
[43440.075668]  [<ffffffff806a98fb>] ? _spin_unlock_irq+0x2b/0x40
[43440.075668]  [<ffffffff80244f0f>] ? finish_task_switch+0x5f/0x120
[43440.075668]  [<ffffffff80244eb0>] ? finish_task_switch+0x0/0x120
[43440.075668]  [<ffffffff804deb0c>] scsi_error_handler+0x8c/0x5d0
[43440.075668]  [<ffffffff8023e8ab>] ? complete+0x4b/0x60
[43440.075668]  [<ffffffff804dea80>] ? scsi_error_handler+0x0/0x5d0
[43440.075668]  [<ffffffff80267839>] kthread+0x49/0x90
[43440.075668]  [<ffffffff80213b19>] child_rip+0xa/0x11
[43440.075668]  [<ffffffff80212d23>] ? restore_args+0x0/0x30
[43440.075668]  [<ffffffff802677f0>] ? kthread+0x0/0x90
[43440.075668]  [<ffffffff80213b0f>] ? child_rip+0x0/0x11

While I appreciate you and Vlad's help immensely, from what I gather
(and I could be wrong), neither of you work on ib_srp.  Is there
someone from the initiator side we could bring-in on this?

Some clarifications on my testing procedures: in testing block sizes,
I usually start w/ 1MB blocks, then decrease the block size to lesser
powers of two, finishing  with 512B blocks... so when I say "the
problem occurs w/ the RHEL/OFED target at <2KB blocks", that means it
made it through the 8KB blocks w/o hanging (where the Ubuntu hangs).

While being able to remove the scst/scst_vdisk/ib_srpt modules from
the target once the hang has occurred, and not being able to remove
ib_srp on the initiator, seems to implicate the initiator, it could be
that the target is in a state in the protocol where it believes
nothing is outstanding, while the initiator is waiting in another
state, for more from the target.  If that is the case, then it's hard
to say who's to blame.

Thanks,

Chris



More information about the general mailing list