[ofw] patch: Fix a race in the cl_timer code that caused deadlocks in opensm
Hefty, Sean
sean.hefty at intel.com
Fri Jul 2 08:32:53 PDT 2010
> KeAcquireGuardedMutex - can not be acquired at dispatch level, in other
> words, this means that it can not guard the callback which is in DPC. We
> probably need another spinlock.
The kernel patch was changed to use a spinlock and KeAcquireInStackQueuedSpinLockAtDpcLevel() and KeAcquireInStackQueuedSpinLock().
> Is the patch to umad.cpp realy related here? (to the opensm issue)
The kernel timer code controls MAD retries, so it will affect umad and opensm, along with the CM and anything else that tries to send a MAD.
> In the function __destroying_ioc_pnp we should probably call keflushdpcs()
> in order to make sure that the timer was really stopped. Maybe we should
> put that in the stop to make it generic?
KeFlushQueuedDpcs is in stop timer. Personally, I don't care about ibal's pnp goop.
> It also seems to me that there might be still races in the synchronization
> of the timer call back. Maybe in the timer-callbac, if timeout is 0 the
> callback should not run? I must say that I still need to think about it.
This is a fairly narrow race. The timeout is only set to 0 in stop or in the callback. First, stop is never used. Second, whether the callback starts before stop can set the timeout to 0 is arbitrary. If we cared, the kernel locking would need to look more like the userspace locking. As it is, the user and kernel abstraction simply have different behavior. In the kernel, calling stop will cancel the timer, but any active callback will still run. Personally, I don't see an issue with this, and it matches what the kernel provides. We really only care that the callback won't run after calling destroy, which is handled.
- Sean
More information about the ofw
mailing list