[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Tzachi Dar tzachid at mellanox.co.il
Thu Jun 24 15:24:49 PDT 2010


2 quick comments:
I apologize for not reading thoroughly all the previous mails...

1) On opensm, there is a call to timer start from the timer callback as well as from the timer code => multiple calls to timer start.
2) It seems to me that the implementation that we are using is very complex since callbacks can be called by multiple threads. If we limit the callbacks to a single thread, life is much easier, and as far as opensm is interested it has not performance penalty, since opensm is using one lock for working with the timer.

So, I believe that creating the code from scratch (with our thread) will make our lives much simpler.

Thanks
Tzachi



> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty at intel.com]
> Sent: Friday, June 25, 2010 1:19 AM
> To: Smith, Stan; Fab Tillier; Tzachi Dar; ofw at lists.openfabrics.org
> Cc: Uri Habusha; Yevgeny Kliteynik
> Subject: RE: [ofw] patch: Fix a race in the cl_timer code that caused
> deadlocks in opensm
> 
> > I'm performing experiments to find sailent points of interest, not
> looking
> > for a solution by moving code....
> > Is it GetThreadId() inside of the lock?
> > Is it the thread_id = 0 ?
> > What's magic?
> 
> thread_id is being used in an attempt to detect if a call into
> cl_timer_start and/or cl_timer_stop are being made from the timer
> callback.  It does this to know if it can block waiting for the timer
> callback thread to complete.  If we're calling from the timer callback
> thread, we can't block.
> 
> Currently the cl_timer callback routine can be invoked multiple times
> at once.  If thread_id is set outside of a lock then:
> 
> thread 1 - sets thread_id = 1
> thread 2 - overwrites thread_id = 2
> thread 1 grabs the lock and invokes the user's callback
> thread 1 calls cl_timer_stop, which compares GetThreadId() with
> thread_id
> 	   the two do not match, so the code falsely assumes that we are
> not
> 	   in a callback
> 
> To prevent this, thread_id must be set under the same lock used to
> serialize the callback.
> 
> The second issue is whether thread_id needs to be set to 0.  If
> thread_id is NOT reset to 0:
> 
> thread 1 - sets thread_id = 1 for timer X
> thread 1 invokes the user's callback and returns
> 
> thread 1 is part of a system thread pool.  If that thread is used for
> some other purpose, for example, to process some other timer Y:
> 
> thread 1 is handling timer Y
> thread 1 calls cl_timer_stop for timer X
> thread 1 is NOT in the callback for timer X, but cl_timer_stop assumes
> that it is
> 
> To prevent this, thread_id must be reset to 0, and it needs to be under
> the same lock as above to prevent overwriting some other thread setting
> it.
> 
> I'm working on a patch to make the cl_timer code simpler based on my
> previous email.



More information about the ofw mailing list