[Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance

Tue Jan 13 10:08:03 PST 2009

Cameron Harr, on 01/13/2009 07:42 PM wrote:
> Vladislav Bolkhovitin wrote:
>> Cameron Harr, on 01/13/2009 02:56 AM wrote:
>>> Vladislav Bolkhovitin wrote:
>>>>>> I think srptthread=0 performs worse in this case, because with it 
>>>>>> part of processing done in SIRQ, but seems scheduler make it be 
>>>>>> done on the same CPU as fct0-worker, which does the data transfer 
>>>>>> to your SSD device job. And this thread is always consumes about 
>>>>>> 100% CPU, so it has less CPU time, hence less overall performance.
>>>>>>
>>>>>> So, try to affine fctX-worker, SCST threads and SIRQ processing on 
>>>>>> different CPUs and check again. You can affine threads using 
>>>>>> utility from 
>>>>>> http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/, 
>>>>>> how to affine IRQ see Documentation/IRQ-affinity.txt in your 
>>>>>> kernel tree. 
>>> I ran with the two fct-worker threads pinned to cpus 7,8, the 
>>> scsi_tgt threads pinned to cpus 4, 5 or 6 and irqbalance pinned on 
>>> cpus 1-3. I wasn't sure if I should play with the 8 ksoftirqd procs, 
>>> since there is one process per cpu. From these results, I don't see a 
>>> big difference, 
>> Hmm, you sent me before the following results:
>>
>> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=1 
>> iops=54934.31
>> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=0 
>> iops=50199.90
>> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=1 
>> iops=51510.68
>> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=0 
>> iops=49951.89
>> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=1 
>> iops=51924.17
>> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=0 
>> iops=49874.57
>> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=1 
>> iops=79680.42
>> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=0 
>> iops=74504.65
>> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=1 
>> iops=78558.77
>> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=0 
>> iops=75224.25
>> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=1 
>> iops=75411.52
>> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=0 
>> iops=73238.46
>>
>> I see quite a big improvement. For instance, for drives=1 
>> scst_threads=1 srptthread=1 case it is 36%. Or, do you use different 
>> hardware, so those results can't be compared?
> Vlad, you've got a good eye. Unfortunately, those results can't really 
> be compared because I believe the previous results were intentionally 
> run in a worse-case performance scenario. However I did run no-affinity 
> runs before the affinity runs and would say performance increase is 
> variable and somewhat inconclusive:
> 
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=1 iops=76724.08
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=1 iops=91318.28
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=1 iops=60374.94
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=1 iops=91618.18
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=1 iops=63076.21
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=1 iops=92251.24
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=0 iops=50539.96
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=0 iops=57884.80
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=0 iops=54502.85
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=0 iops=93230.44
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=0 iops=55941.89
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=0 iops=94480.92

For srptthread=0 case there is a consistent quite big increase.

>>> but would still give srpt thread=1 a slight performance advantage.
>> At this level CPU caches starting playing essential role. To get the 
>> maximum performance the commands processing of each command should use 
>> the same CPU L2+ cache(s), i.e. be done on the same physical CPU, but 
>> on different cores. Most likely, affinity assigned by you was worse, 
>> than the scheduler decisions. What's your CPU configuration? Please 
>> send me the top/vmstat output during tests from the target as well as 
>> your dmesg from the target just after it's booted.
> My CPU config on the target (where I did the affinity) is 2 quad-core 
> Xeon E5440 @ 2.83GHz. I didn't have my script configured to dump top and 
> vmstat, so here's data from a rerun (and I have attached requested 
> info). I'm not sure what is accounting for the spike at the beginning, 
> but it seems consistent.
> 
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=1 iops=104699.43
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=1 iops=133928.98
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=1 iops=82736.73
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=1 iops=82221.42
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=1 iops=70203.53
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=1 iops=85628.45
> type=randwrite  bs=4k   drives=1 scst_threads=1 srptthread=0 iops=75646.90
> type=randwrite  bs=4k   drives=2 scst_threads=1 srptthread=0 iops=87124.32
> type=randwrite  bs=4k   drives=1 scst_threads=2 srptthread=0 iops=74545.84
> type=randwrite  bs=4k   drives=2 scst_threads=2 srptthread=0 iops=88348.71
> type=randwrite  bs=4k   drives=1 scst_threads=3 srptthread=0 iops=71837.15
> type=randwrite  bs=4k   drives=2 scst_threads=3 srptthread=0 iops=84387.22

Why there is such a huge difference with the results you sent in the 
previous e-mail? For instance, for case drives=1 scst_threads=1 
srptthread=1 104K vs 74K. What did you changed?

What is content of /proc/interrupts after the tests?