[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Thu Jun 24 15:46:05 PDT 2010

Tzachi Dar wrote on Thu, 24 Jun 2010 at 15:24:49

> 2 quick comments:
> I apologize for not reading thoroughly all the previous mails...
> 
> 1) On opensm, there is a call to timer start from the timer callback as
> well as from the timer code => multiple calls to timer start.

Why does opensm start the same timer from two threads?  Once the timer is started for the first time, shouldn't it just get restarted from (at the end of) the timer callback?  Having two places trying to start the timer seems like a buggy design.

> 2) It seems to me that the implementation that we are using is very
> complex since callbacks can be called by multiple threads. If we limit
> the callbacks to a single thread, life is much easier, and as far as
> opensm is interested it has not performance penalty, since opensm is
> using one lock for working with the timer.

Yes, we want a single callback per timer.  If there are multiple timers, it shouldn't matter if their respective callbacks are executed on different threads in parallel.  Code that requires serialization between different timer's callbacks should use a single timer.

> So, I believe that creating the code from scratch (with our thread)
> will make our lives much simpler.

What call do you create the thread in?  cl_timer_init?  Do you create a thread per timer?  How do you clean up the threads?  Do you keep a global count of timers and destroy the thread when you destroy the last timer?  Have you ever looked at how many threads there are in an opensm process?  I just checked and it was 26.  Is adding more threads really the answer, or should we maybe evaluate the threading design a bit better?

In any case, before we code anything, we need to have clear requirements for the timer.  If the timer in complib isn't suitable for opensm, maybe opensm should have its own timer implementation rather than using complib's.  It all depends on how many other users of complib timers there are, and what their expectations are.  I'll say it again, because I think it's worth repeating: if an app serializes multiple timer callbacks itself, it really shouldn't be using multiple timers.  This assumes a proper timer implementation that runs at most a single callback at a time per timer, of course.

-Fab