[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Smith, Stan stan.smith at intel.com
Thu Jun 24 14:29:58 PDT 2010


Hefty, Sean wrote:
>> The above code sequence in user mode cl_timer.c fails DHCP address
>> assignment upon compute node reboot.
>> HCA ports are 'ACTIVE' but no DHCP assignment. Kernel cl_timer V2
>> patches installed.
>>
>> If you go back to Tzachi's patch, then you get DHCP address
>> assignment correctly . thread_id = GetThreadId();
>> lock cb_serialize
>> callback()
>> unlock cb_serialize
>>
>> Currently building/testing without the lock/unlock cb_serialize.
>>
>> Will also test with
>>
>> thread_id = GetThreadId();
>> lock cb_serialize
>> callback()
>> thread_id = 0
>> unlock cb_serialize
>>
>> stay tuned.
>
> We are likely hitting another issue here.  If thread_id is not reset
> to 0 and not set under the cb_serialize lock, then the check in
> cl_timer_stop will not work reliably.  Moving code around until some
> test case passes isn't the approach we should be using.  Both code
> segments above are racy.  We're dealing with some race conditions
> that aren't going to be easy to reproduce.

I'm performing experiments to find sailent points of interest, not looking for a solution by moving code....
Is it GetThreadId() inside of the lock?
Is it the thread_id = 0 ?
What's magic?

>
> Tzachi successfully identified races in cl_timer.  We need to fix
> those, and if the fallout is that other bugs are more easily exposed,
> with consistent failures, then that's a good thing.
>
> - Sean




More information about the ofw mailing list