[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Smith, Stan stan.smith at intel.com
Wed Jun 23 22:28:01 PDT 2010


Hefty, Sean wrote:
>> Forgot to mention, you should use the cb_serialize lock to protect
>> setting the thread ID too.
>
> The thread ID needs to be protected using 'spinlock', since that lock
> is held when reading it later.  Or maybe cb_serialize would work, as
> long as thread_id is cleared after the callback returns...
>
> Actually, I think setting thread_id = 0 is required, since its
> purpose is to see if timer_stop is being called from the callback.

Here's one for the I wonder why crowd....

If thread_id is set to zero under the cb_serialize lock, immediately after the return from the timer callback pfn_callback(), when compute nodes reboot they do not get IPv4 addresses by DHCP?
Remove just the p_timer->thread_id = 0 after the callback and DHCP starts assigning addresses upon reboot? Who would have thought?

Also, DAPL tests hang when cl_timer callbacks are serialized.
I suspect there has been a long standing bug that was never noticed.....sigh.

Perhaps opensm should have it's own user-mode implementation of cl_timer ?



More information about the ofw mailing list