[ofa-general] SRP/mlx4 interrupts throttling performance
Cameron Harr
cameron at harr.org
Tue Oct 7 08:44:44 PDT 2008
Vladislav Bolkhovitin wrote:
> Cameron Harr wrote:
>> Cameron Harr wrote:
>>>> This is still too high. Considering that each CS is about 1
>>>> microsecond you can estimate how many IOPS's it costs you.
>>> Dropping scst_threads down to 2, from 8, with 2 initiators, seems to
>>> make a fairly significant difference, propelling me to a little over
>>> 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2
>>> threads gave the best performance compared to 1, 4 and 8.
>>
>> Just as a status update, I've gotten my best performance with
>> scst_threads=3 on 2 initiators, and using a separate QP for each
>> drive an initiator is writing to. I'm getting pretty consistent
>> 112-115K IOPs using two initiators, each writing with 2 processes to
>> the same 2 physical targets, using 512B blocks. Adding the second
>> initiator only bumps me up by about 20K IOPs, but as all the CPUs are
>> pegged around 99%, I'll take that as a bottleneck. Also, as a note
>> from Vlad's advice, the CS rate is now around 70K/s on 115K IOPs, so
>> it's not too bad. Interrupts (where this thread started), are around
>> 200K/s - a lot higher than I thought they'd go, but I'm not
>> complaining. :)
>
> Actually, what you did is tune your workload so it put nicely on all
> the participating threads and CPU cores, so all the threads stay each
> on its own CPU core and gracefully pass commands during processing to
> each other being busy almost all the time. I.e. you put your system in
> some kind of resonance. If you change your workload just a bit or
> Linux scheduler changed in the next kernel version, your tuning would
> be destroyed.
>
This "resonance" thought actually crossed my mind. I later went and ran
the test locally and found that I got better performance via SRP than I
did locally (good marketing for you :) ). The local run, using no
networking, gave me around 2 CS/IO. It appeared that when I added the
second initiator, the requests from the 2 initiators for a single target
would get coalesced, which would improve the performance.
> So, I wouldn't overestimate your results. As I already wrote, the only
> real fix is to remove all the unneeded context switches between
> threads during commands processing. This fix would work not only on
> carefully tuned artificial workloads, but on real life ones too. 5-10
> threads participating in a single command processing reminds me the
> famous set of histories about how many people of some kind is
> necessary to change a burnt out lamp ;)
Nice analogy :). I wish I knew how to eradicate the extra context
switches. I'll try Bart's trick and see if I can get more info:
More information about the general
mailing list