[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm

Hefty, Sean sean.hefty at intel.com
Thu Jun 24 11:28:39 PDT 2010


> The above code sequence in user mode cl_timer.c fails DHCP address
> assignment upon compute node reboot.
> HCA ports are 'ACTIVE' but no DHCP assignment. Kernel cl_timer V2 patches
> installed.
> 
> If you go back to Tzachi's patch, then you get DHCP address assignment
> correctly .
> thread_id = GetThreadId();
> lock cb_serialize
> callback()
> unlock cb_serialize
> 
> Currently building/testing without the lock/unlock cb_serialize.
> 
> Will also test with
> 
> thread_id = GetThreadId();
> lock cb_serialize
> callback()
> thread_id = 0
> unlock cb_serialize
> 
> stay tuned.

We are likely hitting another issue here.  If thread_id is not reset to 0 and not set under the cb_serialize lock, then the check in cl_timer_stop will not work reliably.  Moving code around until some test case passes isn't the approach we should be using.  Both code segments above are racy.  We're dealing with some race conditions that aren't going to be easy to reproduce.

Tzachi successfully identified races in cl_timer.  We need to fix those, and if the fallout is that other bugs are more easily exposed, with consistent failures, then that's a good thing.

- Sean



More information about the ofw mailing list