[ofa-general] SRP/mlx4 interrupts throttling performance

Tue Oct 7 08:44:44 PDT 2008

Vladislav Bolkhovitin wrote:
> Cameron Harr wrote:
>> Cameron Harr wrote:
>>>> This is still too high. Considering that each CS is about 1 
>>>> microsecond you can estimate how many IOPS's it costs you.
>>> Dropping scst_threads down to 2, from 8, with 2 initiators, seems to 
>>> make a fairly significant difference, propelling me to a little over 
>>> 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 
>>> threads gave the best performance compared to 1, 4 and 8.
>>
>> Just as a status update, I've gotten my best performance with 
>> scst_threads=3 on 2 initiators, and using a separate QP for each 
>> drive an initiator is writing to. I'm getting pretty consistent 
>> 112-115K IOPs using two initiators, each writing with 2 processes to 
>> the same 2 physical targets, using 512B blocks. Adding the second 
>> initiator only bumps me up by about 20K IOPs, but as all the CPUs are 
>> pegged around 99%, I'll take that as a bottleneck. Also, as a note 
>> from Vlad's advice, the CS rate is now around 70K/s on 115K IOPs, so 
>> it's not too bad. Interrupts (where this thread started), are around 
>> 200K/s - a lot higher than I thought they'd go, but I'm not 
>> complaining. :)
>
> Actually, what you did is tune your workload so it put nicely on all 
> the participating threads and CPU cores, so all the threads stay each 
> on its own CPU core and gracefully pass commands during processing to 
> each other being busy almost all the time. I.e. you put your system in 
> some kind of resonance. If you change your workload just a bit or 
> Linux scheduler changed in the next kernel version, your tuning would 
> be destroyed.
>
This "resonance" thought actually crossed my mind. I later went and ran 
the test locally and found that I got better performance via SRP than I 
did locally (good marketing for you :) ). The local run, using no 
networking, gave me around 2 CS/IO. It appeared that when I added the 
second initiator, the requests from the 2 initiators for a single target 
would get coalesced, which would improve the performance.
> So, I wouldn't overestimate your results. As I already wrote, the only 
> real fix is to remove all the unneeded context switches between 
> threads during commands processing. This fix would work not only on 
> carefully tuned artificial workloads, but on real life ones too. 5-10 
> threads participating in a single command processing reminds me the 
> famous set of histories about how many people of some kind is 
> necessary to change a burnt out lamp ;)
Nice analogy :). I wish I knew how to eradicate the extra context 
switches. I'll try Bart's trick and see if I can get more info: