[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Thu Jun 24 14:19:37 PDT 2010

> It depends on whether clients expect to have thread-safe timers or not.
> That is, do you allow multiple threads to act on a timer simultaneously?
> I'm inclined to say not, but Tzachi previously mentioned that there were
> multiple threads in OpenSM that were starting the timer.  If the only
> reason that happened is due to multiple callbacks, then we can probably get
> away without serialization, in which case my code changes also become quite
> a bit simpler.

See start below.  It doesn't matter if multiple threads call it or not.

> > timer_callback(...)
> > {
> > 	EnterCriticalSection(p_timer->lock);
> > 	p_timer->pfn_callback(..);
> > 	LeaveCriticalSection(p_timer->lock);
> 
> Didn't we try holding a lock over the callback and ended up with things not
> working?  This goes back to using the threadpool but not knowing exactly
> how it spins up threads etc.

This has to work, otherwise, the user can't even acquire a lock in their callback, which they do.  If serializing the callback doesn't work, the client must be expecting multiple callbacks for a single timer.  In that case, we just remove the locking above and we're done.

> > cl_timer_start(..)
> > {
> > 	HANDLE timer;
> > 	/* optional:
> > 	DeleteTimerQueueTimerEx(p_timer->timer_queue, NULL); */
> > 	return CreateTimerQueueTimer(&timer, p_timer->timer_queue,
> > timer_callback, ...);
> > }
> >
> > cl_timer_stop(..)
> > {
> > 	DeleteTimerQueueTimerEx(p_timer->timer_queue,
> > INVALID_HANDLE_VALUE);
> > }
> >
> > cl_timer_trim(..)
> > {
> > 	cl_timer_start(..)
> > }
> >
> > This seems MUCH simpler and serializes callbacks.  Do we really need
> > all the extra timer state checking goop just to avoid a few additional
> > callbacks?  I have a hard time believing that adding more complex
> > locking than what's already there won't introduce other, more difficult
> > to find race conditions.
> 
> Depends on whether there ends up being issues with multiple thread pool
> threads blocked on that critsec.  I thought there was an issue reported
> with this...

Someone needs to identify the deadlock, if it exists, because the deadlock is likely there in any case.  We're trying to get waaaaaaaay to fancy with this.