[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm
Hefty, Sean
sean.hefty at intel.com
Thu Jun 24 10:28:39 PDT 2010
Stan, can you repost just the user space changes to cl_timer?
> > If thread_id is set to zero under the cb_serialize lock, immediately
> > after the return from the timer callback pfn_callback(), when compute
> > nodes reboot they do not get IPv4 addresses by DHCP?
> > Remove just the p_timer->thread_id = 0 after the callback and DHCP
> > starts assigning addresses upon reboot? Who would have thought?
>
> This is only the user-mode changes? If you back those changes out on only
> the node running OpenSM, do things work? That is, is it a problem with
> OpenSM not bringing things up properly?
I gave this more thought, and I'm convinced that we need thread_id set as this:
lock cb_serialize
thread_id = GetID()
callback()
thread_id = 0
unlock cb_serialize
if we want cl_timer_stop to block. Without this, the checks in cl_timer_stop/start won't work correctly. Personally, I don't see any reason to keep the check in cl_timer_stop to see if we're calling from the callback. Nothing that I see actually does this anyway, and it's just adding more complexity to the code.
> > Also, DAPL tests hang when cl_timer callbacks are serialized. I suspect
> > there has been a long standing bug that was never noticed.....sigh.
>
> Nothing wrong with uncovering bugs. I believe that's called progress. :)
>
> > Perhaps opensm should have it's own user-mode implementation of cl_timer
> ?
I thought about this, and it may be necessary if we need to serialize all timers. But I still believe that the other cl_timer implementation needs to be fixed.
> I think we all agree that the multiple callback model for timers is broken
> - each timer should have at most one callback executing at a time. So we
> need to change cl_timer both in user and kernel to at least fix that. If
> other code ends up broken, that's a side effect.
The calling code should definitely not rely on a single timer calling them back multiple times.
> One thing to keep in mind is that with the lock to serialize the callback
> threads, you may block threadpool threads that need to run something else,
> so the deadlock might be a side effect of serializing the callbacks with a
> lock.
What thread pool are you referring to?
I believe all users of cl_timer acquire a lock to serialize access to their own lists. In theory, all we've done is move the locking down into the cl_timer abstraction.
- Sean
More information about the ofw
mailing list