[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Fab Tillier ftillier at microsoft.com
Thu Jun 24 09:21:49 PDT 2010


Smith, Stan wrote on Wed, 23 Jun 2010 at 22:28:01

> Hefty, Sean wrote:
>>> Forgot to mention, you should use the cb_serialize lock to protect
>>> setting the thread ID too.
>> 
>> The thread ID needs to be protected using 'spinlock', since that lock
>> is held when reading it later.  Or maybe cb_serialize would work, as
>> long as thread_id is cleared after the callback returns...
>> 
>> Actually, I think setting thread_id = 0 is required, since its
>> purpose is to see if timer_stop is being called from the callback.
> 
> Here's one for the I wonder why crowd....
> 
> If thread_id is set to zero under the cb_serialize lock, immediately
> after the return from the timer callback pfn_callback(), when compute
> nodes reboot they do not get IPv4 addresses by DHCP?
> Remove just the p_timer->thread_id = 0 after the callback and DHCP
> starts assigning addresses upon reboot? Who would have thought?

This is only the user-mode changes?  If you back those changes out on only the node running OpenSM, do things work?  That is, is it a problem with OpenSM not bringing things up properly?

> Also, DAPL tests hang when cl_timer callbacks are serialized. I suspect
> there has been a long standing bug that was never noticed.....sigh.

Nothing wrong with uncovering bugs.  I believe that's called progress. :)
 
> Perhaps opensm should have it's own user-mode implementation of cl_timer ?

I think we all agree that the multiple callback model for timers is broken - each timer should have at most one callback executing at a time.  So we need to change cl_timer both in user and kernel to at least fix that.  If other code ends up broken, that's a side effect.

One thing to keep in mind is that with the lock to serialize the callback threads, you may block threadpool threads that need to run something else, so the deadlock might be a side effect of serializing the callbacks with a lock.

-Fab




More information about the ofw mailing list