[ofa-general] OFED1.3.1: soft lockup in completion handler

Liang Zhen Zhen.Liang at Sun.COM
Thu Apr 23 08:13:51 PDT 2009


Eli Cohen wrote:
> Where is the other place that you acquire conn->ibc_sched->ibs_lock?
> Is it in the per CPU thread? Maybe you should try to decrease the time
> when the lock is acquired at the thread. Can you send all references
> to the code aquiring the lock?
>   
Eli,

Yes, it's a per-CPU lock (for CPU affinity thread), so completion 
callback can only race with one thread a time, and I'm very sure the 
thread just do very light operations with the lock.
I actually already know the reason: Most time, interrupts always perfer 
to happen on the same cpu, so there is no chance to schedule 
CPU-watchdog thread if there is quite a lot interrupts on the cpu, i.e: 
100K/Sec, although we have irqbalance, but it's wakeup per 10 
seconds(it's a pity that we can't change to interval, it's hard-code 
constant), which is enough to trigger soft lockup warning.
So I add a static counter for completion handler, and when we found 
there are too many interrupts for several seconds, we just call 
touch_softlockup_watchdog() to tell watchdog the CPU is OK, not soft 
lockup..... Also, we reserve the first core on each CPU socket (no 
affinity thread bound on it), to make sure there are enough cores to 
handle interrupts.
I know it's urgly, but it works and it's the only way that I can find to 
resolve my problem,  if there is no mutiple completion vectors.

Anyway, I really think multiple completion vectors will be an important 
feature in the recent future, because our hardwares are more and more 
faster, and machines have more and more CPU-cores.

Thanks
Liang

>
>   
>> I've tried to turn off irqbalancer and set /proc/irq/.../smp_affinity
>> for more cores, but changed nothing and still soft lockup.
>>
>> After I installed ofed1.4.1 and create CQ with
>> ib_create_cq(....comp_vector), the problem is gone and get really good
>> performance. The problem now is, seems ofed1.4.1:mlx4 is the only driver
>> can really support multiple completion vectors, but we can't expect all
>> customers to have the same environment...
>> Is there only other possible way to resolve this?
>>
>> Thanks
>> Liang
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>     




More information about the general mailing list