[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Thu Jun 24 10:28:39 PDT 2010

Stan, can you repost just the user space changes to cl_timer?

> > If thread_id is set to zero under the cb_serialize lock, immediately
> > after the return from the timer callback pfn_callback(), when compute
> > nodes reboot they do not get IPv4 addresses by DHCP?
> > Remove just the p_timer->thread_id = 0 after the callback and DHCP
> > starts assigning addresses upon reboot? Who would have thought?
> 
> This is only the user-mode changes?  If you back those changes out on only
> the node running OpenSM, do things work?  That is, is it a problem with
> OpenSM not bringing things up properly?

I gave this more thought, and I'm convinced that we need thread_id set as this:

lock cb_serialize
thread_id = GetID()
callback()
thread_id = 0
unlock cb_serialize

if we want cl_timer_stop to block.  Without this, the checks in cl_timer_stop/start won't work correctly.  Personally, I don't see any reason to keep the check in cl_timer_stop to see if we're calling from the callback.  Nothing that I see actually does this anyway, and it's just adding more complexity to the code.

> > Also, DAPL tests hang when cl_timer callbacks are serialized. I suspect
> > there has been a long standing bug that was never noticed.....sigh.
> 
> Nothing wrong with uncovering bugs.  I believe that's called progress. :)
> 
> > Perhaps opensm should have it's own user-mode implementation of cl_timer
> ?

I thought about this, and it may be necessary if we need to serialize all timers.  But I still believe that the other cl_timer implementation needs to be fixed. 

> I think we all agree that the multiple callback model for timers is broken
> - each timer should have at most one callback executing at a time.  So we
> need to change cl_timer both in user and kernel to at least fix that.  If
> other code ends up broken, that's a side effect.

The calling code should definitely not rely on a single timer calling them back multiple times.

> One thing to keep in mind is that with the lock to serialize the callback
> threads, you may block threadpool threads that need to run something else,
> so the deadlock might be a side effect of serializing the callbacks with a
> lock.

What thread pool are you referring to?

I believe all users of cl_timer acquire a lock to serialize access to their own lists.  In theory, all we've done is move the locking down into the cl_timer abstraction.

- Sean