[ofw] [PATCH] complib/user: fix timer race conditions

Mon Jun 28 00:28:03 PDT 2010

Hi Sean,

It seems that I have kind of missed you with this last mail, so I would like to repeat my points to be more clear.

1) Opensm calls start many times simultaneously, this is where the problem started, and we need to address that (or change opensm).
2) Even if I call start only one time there are still a few problems that I need to deal with.
Let's look at the following simple pasudo code:

TimerInit();
TimerStart(100ms);

TimerStop();

TimerDelete();

With this code, I'm still having the following problems:
1) There will be a memory leak, if the callback has already started and finished and cleared the timer handler.
2) After TimerStop returns, there is no way to know if the callback was already running or not. That is if the timer was running, the handle was cleaned, stop returns but one can not tell if the callback will run, is still running, has finished, or actually was canceled and will never happen.
3) As a result of 2, one can never know when he can delete the timer. Deleting the timer means destroying the spinlock at the end of the callback but one has lost all tracking of the callback.

Please note that all this issues are solved by the one threaded timer. They can also be solved by the more complex timers but it is far from being simple.

Thanks
Tzachi 

> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty at intel.com]
> Sent: Monday, June 28, 2010 8:25 AM
> To: Tzachi Dar; Smith, Stan; Fab Tillier; ofw at lists.openfabrics.org
> Cc: Uri Habusha; Yevgeny Kliteynik
> Subject: RE: [PATCH] complib/user: fix timer race conditions
> 
> > How can I know how many callbacks are still there? That is, if I call
> start
> > once, there will be only one callback.
> > On some reasonable model, I'll not call start again until the
> callback
> > happen.
> > But if we want to allow to call start before the callback returns
> than I
> > can't tell if I'll have one more call back, or two more.
> 
> This is an issue with the caller, and not something that the
> abstraction should really be dealing it.  For example, if a timer
> callback is in progress and the user calls start, there's no way for
> the abstraction to know how many callbacks the user will receive.  The
> callback in progress could just be *entering* the user's code, or just
> be *exiting* the code.  All the abstraction can do is guarantee that
> the user will receive at least one callback.
> 
> There's a point where the app needs to use separate timers if they want
> to count callbacks.  In reality, calling start_timer on a single timer
> multiple times just doesn't make any sense.
> 
> > This thing is even worse since stop should wait for all callbacks to
> > return.
> > Obviously in the current code, there is a chance that if the callback
> has
> > started but not finished the function
> > DeleteTimerQueueTimer( p_timer->h_timer_queue, timer,
> INVALID_HANDLE_VALUE
> > );
> >
> > Will never be called.
> >
> > Please note that even if it will be called, it will only be called
> for the
> > last callback and not for other callbacks that might be running.
> 
> The calling of DeleteTimerQueueTimer is simply an optimization.  Things
> should work fine without it.  The user will just get additional timer
> callback(s).  In fact, if we remove it, then the user will get one
> callback for every call to timer_start and timer_trim.
> 
> > I don't see how the second problem can be fixed here. That is, if one
> calls
> > start 50 times, and 50 callbacks have started but not finished, there
> is no
> > way to say that when they are finished...
> 
> The OS is tracking the timers.  We only care that they are finished
> when we're trying to destroy the timer queue, which is why destroy
> blocks in DeleteTimerQueueEx until all outstanding callbacks have
> completed.
> 
> - Sean