[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Hefty, Sean sean.hefty at intel.com
Thu Jun 24 15:19:01 PDT 2010


> I'm performing experiments to find sailent points of interest, not looking
> for a solution by moving code....
> Is it GetThreadId() inside of the lock?
> Is it the thread_id = 0 ?
> What's magic?

thread_id is being used in an attempt to detect if a call into cl_timer_start and/or cl_timer_stop are being made from the timer callback.  It does this to know if it can block waiting for the timer callback thread to complete.  If we're calling from the timer callback thread, we can't block.

Currently the cl_timer callback routine can be invoked multiple times at once.  If thread_id is set outside of a lock then:

thread 1 - sets thread_id = 1
thread 2 - overwrites thread_id = 2
thread 1 grabs the lock and invokes the user's callback
thread 1 calls cl_timer_stop, which compares GetThreadId() with thread_id
	   the two do not match, so the code falsely assumes that we are not
	   in a callback

To prevent this, thread_id must be set under the same lock used to serialize the callback.

The second issue is whether thread_id needs to be set to 0.  If thread_id is NOT reset to 0:

thread 1 - sets thread_id = 1 for timer X
thread 1 invokes the user's callback and returns

thread 1 is part of a system thread pool.  If that thread is used for some other purpose, for example, to process some other timer Y:

thread 1 is handling timer Y
thread 1 calls cl_timer_stop for timer X
thread 1 is NOT in the callback for timer X, but cl_timer_stop assumes that it is

To prevent this, thread_id must be reset to 0, and it needs to be under the same lock as above to prevent overwriting some other thread setting it.

I'm working on a patch to make the cl_timer code simpler based on my previous email.



More information about the ofw mailing list