[ofa-general] SRP/mlx4 interrupts throttling performance

Cameron Harr cameron at harr.org
Fri Oct 24 12:38:32 PDT 2008


Vladislav Bolkhovitin wrote:
>> ** Sometimes the benchmark "zombied" (process doing no work, but 
>> process can't be killed) after running a certain amount of time. 
>> However, it wasn't repeatable in a reliable way, so I mark that this 
>> particular run has zombied before.
>
> That means that there is a bug somewhere. Usually such bugs are found 
> in few hours of code auditing (srpt driver is pretty simple) or by 
> using kernel debug facilities (example diff to .config attached). I 
> personally always prefer put my effort on fixing real things, not 
> inventing various workarounds, like srpt_thread in this case.
>
> So I would:
>
>   1. Completely remove srpt thread and all related code. It doesn't do
> anything, which can't be done in SIRQ context (tasklet)
>
>   2. Audit the code to check if it does any action, which it shouldn't 
> do on SIRQ and fix it. This step isn't required, but usually it saves 
> a lot of time of puzzled debugging in the future.
>
>   3. Change in srpt_handle_rdma_comp() and  srpt_handle_new_iu()
> SCST_CONTEXT_THREAD to SCST_CONTEXT_DIRECT_ATOMIC.

I also changed it in srpt_handle_err_comp()
>
> Then I would run the problematic tests (heavy tpc-h workload, e.g.) on 
> debug kernel and fix found problems.
>
> Anyway, Cameron, can you get the latest code from SCST trunk and try 
> with it? It was recently updated. Also please add the case with 
> changes from (3) above.
This is all with version 1.0.1 of SCST (v532).
In my fio test, I do runs with srpt thread=1 and then =0. When it was 
set to zero during the test, I got many errors printed out by FIO, and 
the target eventually crashed. This is the first part of a long call trace.

NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: ib_srpt(U) scst_vdisk(U) scst(U) fio_driver(PU) 
fio_port(PU) autofs4 hidp rfcomm l2cap bluetooth sunrpc ib_ipoib mlx4_ib 
ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api nls_utf8 hfsplus 
dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery 
asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 shpchp 
i2c_core e1000e mlx4_core i5000_edac edac_mc pcspkr ata_piix libata 
sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 25732, comm: scsi_tgt0 Tainted: P      2.6.18-92.1.13.el5 #1
RIP: 0010:[<ffffffff80064bcb>]  [<ffffffff80064bcb>] 
.text.lock.spinlock+0x29/0x30
RSP: 0018:ffffffff80418a88  EFLAGS: 00000086
RAX: ffff810785307fd8 RBX: ffffffff884e68a0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff884e68a0
RBP: ffffffff884e62a0 R08: ffff810790926900 R09: ffff8107909268e8
R10: 0000000000000018 R11: ffffffff884fcab3 R12: 0000000000000001
R13: 0000000000000001 R14: 0000000000000000 R15: ffff8107f0f374c0
FS:  0000000000000000(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000037bc0986d0 CR3: 0000000000201000 CR4: 00000000000006e0
Process scsi_tgt0 (pid: 25732, threadinfo ffff810785306000, task 
ffff810810852100)
Stack:  0000000000000000 ffffffff884c509d ffff8107909268e8 ffff810790926900
 00000002071dd688 0000020000000220 0000000000000200 00000000da984c08
 0000000000000000 ffff8107909267f0 ffff810806ceee20 0000000000000001
Call Trace:
 <IRQ>  [<ffffffff884c509d>] :scst:sgv_pool_alloc+0x10c/0x5d3
 [<ffffffff884c1f85>] :scst:scst_alloc_space+0x5b/0x106
 [<ffffffff884bdc90>] :scst:scst_process_active_cmd+0x4fc/0x131c
 [<ffffffff884bee46>] :scst:scst_cmd_init_done+0x17f/0x3ef
 [<ffffffff884fb1ff>] :ib_srpt:srpt_handle_new_iu+0x281/0x4e7
 [<ffffffff8835ec3d>] :mlx4_ib:mlx4_ib_free_srq_wqe+0x27/0x4f
 [<ffffffff883591da>] :mlx4_ib:get_sw_cqe+0x12/0x30
 [<ffffffff88359c97>] :mlx4_ib:mlx4_ib_poll_cq+0x432/0x48f
 [<ffffffff884fcc43>] :ib_srpt:srpt_completion+0x190/0x250
 [<ffffffff8811aa5b>] :mlx4_core:mlx4_eq_int+0x3b/0x26f
 [<ffffffff8811ac9e>] :mlx4_core:mlx4_msi_x_interrupt+0xf/0x17




More information about the general mailing list