[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Tue Jun 22 09:25:29 PDT 2010

Are you getting multiple callbacks simultaneously because:
1. Multiple threads call cl_timer_start on the same timer simultaneously?
2. The timer callback starts the timer with a short enough interval that it expires before the callback returns?

The case where the timer is started from the callback should be fine:

1. cl_timer_start called by app thread
2. timer fires, callback invoked (thread_id set)
3. app calls cl_timer_start from callback
3.1 cl_timer_start calls cl_timer_stop
3.2 cl_timer_stop checks thread ID, it matches so it does nothing.
3.3 cl_timer_start queues the timer anew

Now if the first timer callback makes any more timer calls on that timer, it needs to serialize them with any other thread that might make timer calls too (i.e. the timer could have expired again and be running the callback concurrently.)  Note that the app must handle its callback running concurrently, so the synchronization problem is likely much larger than just the timer.  Note also that multiple timers being used concurrently could be executing their callbacks in parallel, and the app code needs to handle that.  It's possible OpenSM assumes a single timer queue thread.

In any case, I'm not quite seeing the deadlock (though the timer is definitely not threadsafe, so if the problem is that you have multiple threads calling cl_timer_start concurrently they yes, you have a problem.  But the timer wasn't intended to be threadsafe - driving the state in two directions at the same time is a programming error.

That said, I think making the timer threadsafe is probably easier than fixing the calling code.  I just worry that there might be bigger problems in the calling code if it expects timer callbacks to be serialized.

-Fab

Tzachi Dar wrote on Tue, 22 Jun 2010 at 09:03:56

> Hi Sean,
> 
> The patch that I have proposed was meant to be relatively simple and
> also not to require changes to the code of the calling entities.
> The main problem with the current implementation is that it is using
> windows timers => more than one thread can be used. This is good for
> performance bad makes the code more complex. Opensm is also calling
> start_timer from the callback (I did not want to change that). In the
> current implementation the start also calls stop which is where the
> problem starts.
> 
> Moving to a single thread callback can probably make our lives simpler
> (see the Linux implementation of the cl_timer).
> 
> Again, I wanted to make the change simpler.
> 
> See also bellow.
> 
> Thanks
> Tzachi
> 
>> -----Original Message-----
>> From: Hefty, Sean [mailto:sean.hefty at intel.com]
>> Sent: Tuesday, June 22, 2010 6:45 PM
>> To: Tzachi Dar; ofw at lists.openfabrics.org
>> Cc: Yevgeny Kliteynik
>> Subject: RE: [ofw] patch: Fix a race in the cl_timer code that caused
>> deadlocks in opensm
>> 
>>> This should promise that if a callback is running, p_timer-
>> thread_id
>> will
>>> have the thread_id of the running thread.
>>> 
>>> This field is later used when cl_timer_stop() is being called in order
>>> to prevent deadlocks. (please note that cl_timer_stop() is being
> called
>> from
>>> cl_timer_start(). Please also note that the deadlock happens if the
>>> caller of cl_stop() is actually executing in the context of the call
>>> back
>> itself).
>> 
>> I was just looking at the cl_timer code yesterday for this exact same
>> reason.  Good catch.
>> 
>> Maybe this should be rethought entirely and simply not allow certain
>> timer manipulation from the callback, or require the caller to provide
>> the synchronization.  The callback can return a value to indicate if
>> the timer should stop.
>> 
>> This changes require changes to opensm, that I didn't want to do. Also,
>> are we sure that the kernel doesn't have the same problem?
> 
> I didn't look at the kernel code. It is very different in nature, and
> probably has it's limitations as well.
> 
> [Tzachi Dar]
>>> The fix itself is to sync the start and stop calls and now only one
>>> callback will be running (actually, there is one exception to this
>>> code: if from the callback, a new timer is started, the new callback
>>> might be running on the same time, but: (1) stop has already being
>>> called.
> (2)
>> the
>>> p_timer->thread_id is not being touched after the callback.
>> 
>> A single timer allows multiple, simultaneous callbacks?  If so, this
>> seems wrong.
> 
> Indeed wrong, but I believe that currently opensm can live with this
> since it is using it's own locks.
> 
>> 
>>>  static void CALLBACK
>>>  
>>>  __timer_callback(
>>>                 IN cl_timer_t* const p_timer,
>>>                 
>>>                 IN BOOLEAN timer_signalled )
>>>  {
>>>                 /* timer_signalled is always TRUE, and has no
> value.
>> */
>>> 
>>> +
>>> 
>>> +             DWORD thread_id = GetCurrentThreadId();
>>> 
>>>                 CL_ASSERT( timer_signalled );
>>> +
>>> 
>>>                 UNUSED_PARAM( timer_signalled );
>>> 
>>> 
>>> +             CL_ASSERT(thread_id!=0);
>>> 
>>> +
>>> 
>>>                 p_timer->timeout_time = 0;
>>> -              p_timer->thread_id = GetCurrentThreadId();
>>> 
>>> +             p_timer->thread_id = thread_id;
>>> 
>> 
>> In a free build, I would fully expect the compiler to optimize the
>> local variable thread_id away.
>> 
> This is true, but in a checked build you can see the history in the
> debugger (it helps to find the bug)
> _______________________________________________ ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw