[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Hefty, Sean sean.hefty at intel.com
Thu Jun 24 14:19:37 PDT 2010


> It depends on whether clients expect to have thread-safe timers or not.
> That is, do you allow multiple threads to act on a timer simultaneously?
> I'm inclined to say not, but Tzachi previously mentioned that there were
> multiple threads in OpenSM that were starting the timer.  If the only
> reason that happened is due to multiple callbacks, then we can probably get
> away without serialization, in which case my code changes also become quite
> a bit simpler.

See start below.  It doesn't matter if multiple threads call it or not.

> > timer_callback(...)
> > {
> > 	EnterCriticalSection(p_timer->lock);
> > 	p_timer->pfn_callback(..);
> > 	LeaveCriticalSection(p_timer->lock);
> 
> Didn't we try holding a lock over the callback and ended up with things not
> working?  This goes back to using the threadpool but not knowing exactly
> how it spins up threads etc.

This has to work, otherwise, the user can't even acquire a lock in their callback, which they do.  If serializing the callback doesn't work, the client must be expecting multiple callbacks for a single timer.  In that case, we just remove the locking above and we're done.

> > cl_timer_start(..)
> > {
> > 	HANDLE timer;
> > 	/* optional:
> > 	DeleteTimerQueueTimerEx(p_timer->timer_queue, NULL); */
> > 	return CreateTimerQueueTimer(&timer, p_timer->timer_queue,
> > timer_callback, ...);
> > }
> >
> > cl_timer_stop(..)
> > {
> > 	DeleteTimerQueueTimerEx(p_timer->timer_queue,
> > INVALID_HANDLE_VALUE);
> > }
> >
> > cl_timer_trim(..)
> > {
> > 	cl_timer_start(..)
> > }
> >
> > This seems MUCH simpler and serializes callbacks.  Do we really need
> > all the extra timer state checking goop just to avoid a few additional
> > callbacks?  I have a hard time believing that adding more complex
> > locking than what's already there won't introduce other, more difficult
> > to find race conditions.
> 
> Depends on whether there ends up being issues with multiple thread pool
> threads blocked on that critsec.  I thought there was an issue reported
> with this...

Someone needs to identify the deadlock, if it exists, because the deadlock is likely there in any case.  We're trying to get waaaaaaaay to fancy with this.



More information about the ofw mailing list