[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Tue Jun 22 09:03:56 PDT 2010

Hi Sean,

The patch that I have proposed was meant to be relatively simple and also not to require changes to the code of the calling entities.
The main problem with the current implementation is that it is using windows timers => more than one thread can be used. This is good for performance bad makes the code more complex. Opensm is also calling start_timer from the callback (I did not want to change that). In the current implementation the start also calls stop which is where the problem starts.

Moving to a single thread callback can probably make our lives simpler (see the Linux implementation of the cl_timer).

Again, I wanted to make the change simpler.

See also bellow.

Thanks
Tzachi

> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty at intel.com]
> Sent: Tuesday, June 22, 2010 6:45 PM
> To: Tzachi Dar; ofw at lists.openfabrics.org
> Cc: Yevgeny Kliteynik
> Subject: RE: [ofw] patch: Fix a race in the cl_timer code that caused
> deadlocks in opensm
> 
> > This should promise that if a callback is running, p_timer->thread_id
> will
> > have the thread_id of the running thread.
> >
> > This field is later used when cl_timer_stop() is being called in
> order to
> > prevent deadlocks. (please note that cl_timer_stop() is being called
> from
> > cl_timer_start(). Please also note that the deadlock happens if the
> caller
> > of cl_stop() is actually executing in the context of the call back
> itself).
> 
> I was just looking at the cl_timer code yesterday for this exact same
> reason.  Good catch.
> 
> Maybe this should be rethought entirely and simply not allow certain
> timer manipulation from the callback, or require the caller to provide
> the synchronization.  The callback can return a value to indicate if
> the timer should stop.
> 
This changes require changes to opensm, that I didn't want to do.
> Also, are we sure that the kernel doesn't have the same problem?

I didn't look at the kernel code. It is very different in nature, and probably has it's limitations as well.

[Tzachi Dar] 
> > The fix itself is to sync the start and stop calls and now only one
> > callback will be running (actually, there is one exception to this
> code: if
> > from the callback, a new timer is started, the new callback might be
> > running on the same time, but: (1) stop has already being called. (2)
> the
> > p_timer->thread_id is not being touched after the callback.
> 
> A single timer allows multiple, simultaneous callbacks?  If so, this
> seems wrong.

Indeed wrong, but I believe that currently opensm can live with this since it is using it's own locks. 

> 
> >  static void CALLBACK
> >
> >  __timer_callback(
> >
> >                 IN cl_timer_t* const p_timer,
> >
> >                 IN BOOLEAN timer_signalled )
> >
> >  {
> >
> >                 /* timer_signalled is always TRUE, and has no value.
> */
> >
> > +
> >
> > +             DWORD thread_id = GetCurrentThreadId();
> >
> >                 CL_ASSERT( timer_signalled );
> >
> > +
> >
> >                 UNUSED_PARAM( timer_signalled );
> >
> >
> >
> > +             CL_ASSERT(thread_id!=0);
> >
> > +
> >
> >                 p_timer->timeout_time = 0;
> >
> > -              p_timer->thread_id = GetCurrentThreadId();
> >
> > +             p_timer->thread_id = thread_id;
> >
> 
> In a free build, I would fully expect the compiler to optimize the
> local variable thread_id away.
> 
This is true, but in a checked build you can see the history in the debugger (it helps to find the bug)