<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:p="urn:schemas-microsoft-com:office:powerpoint" xmlns:a="urn:schemas-microsoft-com:office:access" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" xmlns:b="urn:schemas-microsoft-com:office:publisher" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:odc="urn:schemas-microsoft-com:office:odc" xmlns:oa="urn:schemas-microsoft-com:office:activation" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:q="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rtc="http://microsoft.com/officenet/conferencing" xmlns:D="DAV:" xmlns:Repl="http://schemas.microsoft.com/repl/" xmlns:mt="http://schemas.microsoft.com/sharepoint/soap/meetings/" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:ppda="http://www.passport.com/NameSpace.xsd" xmlns:ois="http://schemas.microsoft.com/sharepoint/soap/ois/" xmlns:dir="http://schemas.microsoft.com/sharepoint/soap/directory/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:dsp="http://schemas.microsoft.com/sharepoint/dsp" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sub="http://schemas.microsoft.com/sharepoint/soap/2002/1/alerts/" xmlns:ec="http://www.w3.org/2001/04/xmlenc#" xmlns:sp="http://schemas.microsoft.com/sharepoint/" xmlns:sps="http://schemas.microsoft.com/sharepoint/soap/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmlfile" xmlns:udcp2p="http://schemas.microsoft.com/data/udc/parttopart" xmlns:wf="http://schemas.microsoft.com/sharepoint/soap/workflow/" xmlns:dsss="http://schemas.microsoft.com/office/2006/digsig-setup" xmlns:dssi="http://schemas.microsoft.com/office/2006/digsig" xmlns:mdssi="http://schemas.openxmlformats.org/package/2006/digital-signature" xmlns:mver="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns:mrels="http://schemas.openxmlformats.org/package/2006/relationships" xmlns:spwp="http://microsoft.com/sharepoint/webpartpages" xmlns:ex12t="http://schemas.microsoft.com/exchange/services/2006/types" xmlns:ex12m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:pptsl="http://schemas.microsoft.com/sharepoint/soap/SlideLibrary/" xmlns:spsl="http://microsoft.com/webservices/SharePointPortalServer/PublishedLinksService" xmlns:Z="urn:schemas-microsoft-com:" xmlns:st="" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Calibri","sans-serif";
color:#1F497D;}
span.EmailStyle19
{mso-style-type:personal;
font-family:"Calibri","sans-serif";
color:#1F497D;}
span.EmailStyle20
{mso-style-type:personal-reply;
font-family:"Calibri","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=WordSection1>
<p class=MsoNormal><span style='color:#1F497D'>Because the kernel timer uses
DPCs, the only time it could run in parallel is if it is requeued with a short
timeout from the timer callback itself. If client code restarts the timer
as one of the last operations in the timer callback, there should really be no
issue. So I don’t think we need to guard the callbacks like we had
to in the user-mode timer implementation.<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>Now, as to stop, we can use the
return value of KeCancelTimer to control whether we should call KeFlushQueuedDpcs.
If KeCancelTimer returns TRUE, the timer was cancelled and there’s no
need to flush DPCs. A better solution would be for the DPC to signal an
event that cl_timer_stop could block on (same IRQL requirements as
KeFlushQueuedDpcs.)<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>-Fab<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<div>
<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span
style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Tzachi Dar
[mailto:tzachid@mellanox.co.il] <br>
<b>Sent:</b> Friday, July 02, 2010 3:02 AM<br>
<b>To:</b> Smith, Stan; ofw@lists.openfabrics.org<br>
<b>Cc:</b> Yevgeny Kliteynik; Uri Habusha; Fab Tillier<br>
<b>Subject:</b> RE: [ofw] patch: Fix a race in the cl_timer code that caused
deadlocks in opensm<o:p></o:p></span></p>
</div>
</div>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='color:#1F497D'>Now that we are through
with the user mode timer, I have a few concerns about the kernel one.<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>KeAcquireGuardedMutex –
can not be acquired at dispatch level, in other words, this means that it can not
guard the callback which is in DPC. We probably need another spinlock.<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>Is the patch to </span><span
style='font-size:10.0pt;font-family:"Arial","sans-serif";color:blue'>umad.cpp
realy related here? (to the opensm issue)</span><span style='color:#1F497D'><o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>In the function
__destroying_ioc_pnp we should probably call keflushdpcs() in order to make
sure that the timer was really stopped. Maybe we should put that in the stop to
make it generic?<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>It also seems to me that there
might be still races in the synchronization of the timer call back. Maybe in
the timer-callbac, if timeout is 0 the callback should not run? I must say that
I still need to think about it.<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>From our experience, is the
kernel mode fix relay needed for the opensm issue to work?<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>Thanks<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>Tzachi<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<div style='border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt'>
<div>
<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span
style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Smith, Stan
[mailto:stan.smith@intel.com] <br>
<b>Sent:</b> Wednesday, June 23, 2010 9:47 PM<br>
<b>To:</b> Tzachi Dar; ofw@lists.openfabrics.org<br>
<b>Cc:</b> Yevgeny Kliteynik; Uri Habusha; Fab Tillier<br>
<b>Subject:</b> RE: [ofw] patch: Fix a race in the cl_timer code that caused
deadlocks in opensm<o:p></o:p></span></p>
</div>
</div>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Separated patch files so cut-n-paste into a single patch file did
not introduce any problems.</span><span style='font-size:12.0pt;font-family:
"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>From svn trunk\</span><span style='font-size:12.0pt;font-family:
"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Apply your timer patch as patch -p1 < timer.patch</span><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Apply:</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> patch -p1 < cl_timer_osd.h.patch
user-mode serialize</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> patch -p1 <
cl_timer.c.patch
user-mode serialize</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> patch -p2 <
kernel_timer.patch sean's kernel patches</span><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> if using opensm 3.3.6 (vnedor-umad) patch -p1 <
umad.cpp.patch</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p> </o:p></span></p>
<div class=MsoNormal align=center style='text-align:center'><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'>
<hr size=2 width="100%" align=center>
</span></div>
<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:10.0pt;
font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;
font-family:"Tahoma","sans-serif"'> Tzachi Dar [mailto:tzachid@mellanox.co.il] <br>
<b>Sent:</b> Wednesday, June 23, 2010 8:10 AM<br>
<b>To:</b> Smith, Stan; ofw@lists.openfabrics.org<br>
<b>Cc:</b> Yevgeny Kliteynik; Uri Habusha; Fab Tillier<br>
<b>Subject:</b> RE: [ofw] patch: Fix a race in the cl_timer code that caused
deadlocks in opensm</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>Can you please send one patch,
the way that you see it? (user + kernel)?<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>Thanks<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'>Tzachi<o:p></o:p></span></p>
<p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p>
<div style='border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt'>
<div>
<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span
style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Smith, Stan
[mailto:stan.smith@intel.com] <br>
<b>Sent:</b> Wednesday, June 23, 2010 5:16 PM<br>
<b>To:</b> Tzachi Dar; ofw@lists.openfabrics.org<br>
<b>Cc:</b> Yevgeny Kliteynik; Uri Habusha; Fab Tillier<br>
<b>Subject:</b> RE: [ofw] patch: Fix a race in the cl_timer code that caused
deadlocks in opensm<o:p></o:p></span></p>
</div>
</div>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Hello,</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> In testing on our 52 node HPC Edition cluster using opensm
3.3.6 (vendor-umad) I found your patch by itself insufficient to resolve
the lost MAD problem which occurs when rebooting many compute nodes (DHCP
assigned IPoIB IPv4 address) simultaneously.</span><span style='font-size:12.0pt;
font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Your patch by itself did work at 39 nodes, although failed
consistently at 52 nodes.</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Only after applying your patch plus Sean's kernel cl_timer
patch and my own cl_timer serialization patch can 52 nodes reboot
simultaneously and reach full IPoIB operational status (no lost MADs, all ports
active and all compute nodes get DHCP address assigned correctly).</span><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Will be testing opensm 3.3.6 (vendor-ibal) later today.</span><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>BTW, the system installs, uninstalls performs without noticeable
delays. Will do some performance comparisons later today.</span><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>I suspect the system 'might' get by without the user-mode
serialization patch although it's there as the Linux cl_timer callbacks are
serialized.</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>Will perform experiments in the near future.</span><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>stan.</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>PS: Sorry about the diffs as my patch is integrated with your
original patch. Important part is the lock acquire before invoking callback()
then lock release.</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'>--- a/inc/user/complib/cl_timer_osd.h Wed Jun 23 06:48:25 2010<br>
+++ b/inc/user/complib/cl_timer_osd.h Wed Jun 23 06:48:46 2010<br>
@@ -47,6 +47,8 @@<br>
const void *context;<br>
uint64_t timeout_time;<br>
DWORD thread_id;<br>
+ cl_spinlock_t spinlock;<br>
+ </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:red'>cl_spinlock_t cb_serialize;<br>
</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> <br>
} cl_timer_t;<br>
<br>
--- b/core/complib/user/cl_timer.c Wed Jun 23 06:50:14 2010<br>
+++ b/core/complib/user/cl_timer.c Tue Jun 22 15:02:02 2010<br>
@@ -33,30 +33,34 @@<br>
<br>
#include "complib/cl_timer.h"<br>
<br>
-<br>
static void CALLBACK<br>
__timer_callback( <br>
IN cl_timer_t* const p_timer,<br>
IN BOOLEAN timer_signalled )<br>
{<br>
/* timer_signalled is always TRUE, and has no value. */<br>
+ <br>
+ DWORD thread_id = GetCurrentThreadId();<br>
CL_ASSERT( timer_signalled );<br>
+<br>
UNUSED_PARAM( timer_signalled );<br>
<br>
+ CL_ASSERT(thread_id!=0);<br>
+<br>
p_timer->timeout_time = 0;<br>
- p_timer->thread_id = GetCurrentThreadId();<br>
+ p_timer->thread_id = thread_id;<br>
<br>
+ </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:red'>cl_spinlock_acquire(&p_timer->cb_serialize);</span><span
style='font-size:10.0pt;font-family:"Arial","sans-serif";color:blue'><br>
(p_timer->pfn_callback)( (void*)p_timer->context );<br>
-<br>
- p_timer->thread_id = 0;<br>
+ </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:red'>cl_spinlock_release(&p_timer->cb_serialize);</span><span
style='font-size:10.0pt;font-family:"Arial","sans-serif";color:blue'><br>
}<br>
<br>
-<br>
void<br>
cl_timer_construct(<br>
IN cl_timer_t* const p_timer )<br>
{<br>
p_timer->h_timer = NULL;<br>
+ </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:red'>p_timer->pfn_callback = NULL;<br>
</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> p_timer->timeout_time = 0;<br>
p_timer->thread_id = 0;<br>
}<br>
@@ -75,8 +79,9 @@<br>
cl_timer_construct( p_timer );<br>
<br>
p_timer->pfn_callback = pfn_callback;<br>
+ </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:red'>cl_spinlock_init(&p_timer->cb_serialize);</span><span
style='font-size:10.0pt;font-family:"Arial","sans-serif";color:blue'><br>
p_timer->context = context;<br>
- return( CL_SUCCESS );<br>
+ return cl_spinlock_init(&p_timer->spinlock);<br>
}<br>
<br>
<br>
@@ -87,6 +92,12 @@<br>
CL_ASSERT( p_timer );<br>
<br>
cl_timer_stop( p_timer );<br>
+ </span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:red'>if ( p_timer->pfn_callback )<br>
+ {<br>
+ /* construct does not set these, _init does */<br>
+ cl_spinlock_destroy( &p_timer->spinlock );<br>
+ cl_spinlock_destroy( &p_timer->cb_serialize );<br>
+ }<br>
</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif";
color:blue'> }<br>
<br>
<br>
@@ -97,6 +108,8 @@<br>
{<br>
CL_ASSERT( p_timer );<br>
<br>
+ cl_spinlock_acquire(&p_timer->spinlock);<br>
+ <br>
cl_timer_stop( p_timer );<br>
<br>
p_timer->timeout_time = cl_get_time_stamp() +
(((uint64_t)time_ms) * 1000);<br>
@@ -104,9 +117,12 @@<br>
if( !CreateTimerQueueTimer( &p_timer->h_timer, NULL,
__timer_callback,<br>
p_timer, time_ms, 0, WT_EXECUTEINIOTHREAD ) )<br>
{<br>
+ <br>
+ cl_spinlock_release(&p_timer->spinlock);<br>
return( CL_ERROR );<br>
}<br>
<br>
+ cl_spinlock_release(&p_timer->spinlock);<br>
return( CL_SUCCESS );<br>
}<br>
<br>
@@ -117,26 +133,34 @@<br>
IN const uint32_t time_ms )<br>
{<br>
uint64_t timeout_time;<br>
+ cl_status_t status;<br>
<br>
CL_ASSERT( p_timer );<br>
CL_ASSERT( p_timer->pfn_callback );<br>
<br>
+ cl_spinlock_acquire(&p_timer->spinlock);<br>
+<br>
/* Calculate the timeout time in the timer object. */<br>
timeout_time = cl_get_time_stamp() + (((uint64_t)time_ms) * 1000);<br>
<br>
/* Only pull in the timeout time. */<br>
- if( p_timer->timeout_time && p_timer->timeout_time <
timeout_time )<br>
+ if( p_timer->timeout_time && p_timer->timeout_time <
timeout_time ) {<br>
+ cl_spinlock_release(&p_timer->spinlock);<br>
return( CL_SUCCESS );<br>
+ }<br>
<br>
- return cl_timer_start( p_timer, time_ms );<br>
+ status = cl_timer_start( p_timer, time_ms );<br>
+ <br>
+ cl_spinlock_release(&p_timer->spinlock);<br>
+ return status;<br>
}<br>
<br>
-<br>
void<br>
cl_timer_stop(<br>
IN cl_timer_t* const p_timer )<br>
{<br>
CL_ASSERT( p_timer );<br>
+ cl_spinlock_acquire(&p_timer->spinlock);<br>
<br>
if( p_timer->h_timer && p_timer->thread_id !=
GetCurrentThreadId() )<br>
{<br>
@@ -145,6 +169,8 @@<br>
p_timer->h_timer = NULL;<br>
}<br>
p_timer->timeout_time = 0;<br>
+ <br>
+ cl_spinlock_release(&p_timer->spinlock);<br>
}<br>
<br>
</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p> </o:p></span></p>
<div class=MsoNormal align=center style='text-align:center'><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'>
<hr size=2 width="100%" align=center>
</span></div>
<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:10.0pt;
font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;
font-family:"Tahoma","sans-serif"'> ofw-bounces@lists.openfabrics.org
[mailto:ofw-bounces@lists.openfabrics.org] <b>On Behalf Of </b>Tzachi Dar<br>
<b>Sent:</b> Tuesday, June 22, 2010 12:51 AM<br>
<b>To:</b> ofw@lists.openfabrics.org<br>
<b>Cc:</b> Yevgeny Kliteynik<br>
<b>Subject:</b> [ofw] patch: Fix a race in the cl_timer code that caused
deadlocks in opensm</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal>While debugging a deadlock in the opensm executable, we have
found that there is a problem in the cl_timer code:<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>The problem starts from the fact that there can be two call backs
that are running simultaneously.<o:p></o:p></p>
<p class=MsoNormal>On each call back that runs there is the following code: <o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>static void CALLBACK<o:p></o:p></p>
<p class=MsoNormal>__timer_callback( <o:p></o:p></p>
<p class=MsoNormal>
IN cl_timer_t* const p_timer,<o:p></o:p></p>
<p class=MsoNormal>
IN BOOLEAN timer_signalled )<o:p></o:p></p>
<p class=MsoNormal>{<o:p></o:p></p>
<p class=MsoNormal>
/* timer_signalled is always TRUE, and has no value. */<o:p></o:p></p>
<p class=MsoNormal>
CL_ASSERT( timer_signalled );<o:p></o:p></p>
<p class=MsoNormal>
UNUSED_PARAM( timer_signalled );<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>
p_timer->timeout_time = 0;<o:p></o:p></p>
<p class=MsoNormal>
<span style='color:red'>p_timer->thread_id = GetCurrentThreadId();<o:p></o:p></span></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>
(p_timer->pfn_callback)( (void*)p_timer->context );<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>
<span style='color:red'>p_timer->thread_id = 0;<o:p></o:p></span></p>
<p class=MsoNormal>}<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>This should promise that if a callback is running, <span
style='color:red'>p_timer->thread_id </span>will have the
thread_id of the running thread.<o:p></o:p></p>
<p class=MsoNormal>This field is later used when cl_timer_stop() is being
called in order to prevent deadlocks. (please note that cl_timer_stop() is
being called from cl_timer_start(). Please also note that the deadlock happens
if the caller of cl_stop() is actually executing in the context of the call
back itself).<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>The problem happens when two callbacks will be running and
than the <span style='color:red'>p_timer->thread_id</span> doesn’t
show the correct field.<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>The fix itself is to sync the start and stop calls and now
only one callback will be running (actually, there is one exception to this
code: if from the callback, a new timer is started, the new callback might be
running on the same time, but: (1) stop has already being called. (2) the <span
style='color:red'>p_timer->thread_id</span> is not being touched after the
callback.<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>Thanks<o:p></o:p></p>
<p class=MsoNormal>Tzachi<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>Index: B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/core/complib/user/cl_timer.c<o:p></o:p></p>
<p class=MsoNormal>===================================================================<o:p></o:p></p>
<p class=MsoNormal>---
B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/core/complib/user/cl_timer.c
(revision 5977)<o:p></o:p></p>
<p class=MsoNormal>+++ B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/core/complib/user/cl_timer.c
(revision 5978)<o:p></o:p></p>
<p class=MsoNormal>@@ -33,25 +33,27 @@<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal> #include "complib/cl_timer.h"<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>-<o:p></o:p></p>
<p class=MsoNormal> static void CALLBACK<o:p></o:p></p>
<p class=MsoNormal> __timer_callback( <o:p></o:p></p>
<p class=MsoNormal>
IN cl_timer_t* const p_timer,<o:p></o:p></p>
<p class=MsoNormal>
IN BOOLEAN timer_signalled )<o:p></o:p></p>
<p class=MsoNormal> {<o:p></o:p></p>
<p class=MsoNormal>
/* timer_signalled is always TRUE, and has no value. */<o:p></o:p></p>
<p class=MsoNormal>+
<o:p></o:p></p>
<p class=MsoNormal>+
DWORD thread_id = GetCurrentThreadId();<o:p></o:p></p>
<p class=MsoNormal>
CL_ASSERT( timer_signalled );<o:p></o:p></p>
<p class=MsoNormal>+<o:p></o:p></p>
<p class=MsoNormal>
UNUSED_PARAM( timer_signalled );<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>+
CL_ASSERT(thread_id!=0);<o:p></o:p></p>
<p class=MsoNormal>+<o:p></o:p></p>
<p class=MsoNormal>
p_timer->timeout_time = 0;<o:p></o:p></p>
<p class=MsoNormal>-
p_timer->thread_id = GetCurrentThreadId();<o:p></o:p></p>
<p class=MsoNormal>+
p_timer->thread_id = thread_id;<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>
(p_timer->pfn_callback)( (void*)p_timer->context );<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>-
p_timer->thread_id = 0;<o:p></o:p></p>
<p class=MsoNormal> }<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>-<o:p></o:p></p>
<p class=MsoNormal> void<o:p></o:p></p>
<p class=MsoNormal> cl_timer_construct(<o:p></o:p></p>
<p class=MsoNormal>
IN cl_timer_t*
const p_timer
)<o:p></o:p></p>
<p class=MsoNormal>@@ -76,7 +78,8 @@<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>
p_timer->pfn_callback = pfn_callback;<o:p></o:p></p>
<p class=MsoNormal>
p_timer->context = context;<o:p></o:p></p>
<p class=MsoNormal>-
return( CL_SUCCESS );<o:p></o:p></p>
<p class=MsoNormal>+
return cl_spinlock_init(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal>+
<o:p></o:p></p>
<p class=MsoNormal> }<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>@@ -97,6 +100,8 @@<o:p></o:p></p>
<p class=MsoNormal> {<o:p></o:p></p>
<p class=MsoNormal>
CL_ASSERT( p_timer );<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_acquire(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal>+
<o:p></o:p></p>
<p class=MsoNormal>
cl_timer_stop( p_timer );<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>
p_timer->timeout_time = cl_get_time_stamp() + (((uint64_t)time_ms) * 1000);<o:p></o:p></p>
<p class=MsoNormal>@@ -104,9 +109,12 @@<o:p></o:p></p>
<p class=MsoNormal>
if( !CreateTimerQueueTimer( &p_timer->h_timer, NULL, __timer_callback,<o:p></o:p></p>
<p class=MsoNormal>
p_timer, time_ms, 0, WT_EXECUTEINIOTHREAD ) )<o:p></o:p></p>
<p class=MsoNormal>
{<o:p></o:p></p>
<p class=MsoNormal>+
<o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_release(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal>
return( CL_ERROR );<o:p></o:p></p>
<p class=MsoNormal>
}<o:p></o:p></p>
<p class=MsoNormal>
<o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_release(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal>
return( CL_SUCCESS );<o:p></o:p></p>
<p class=MsoNormal> }<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>@@ -117,26 +125,34 @@<o:p></o:p></p>
<p class=MsoNormal>
IN const
uint32_t
time_ms )<o:p></o:p></p>
<p class=MsoNormal> {<o:p></o:p></p>
<p class=MsoNormal>
uint64_t
timeout_time;<o:p></o:p></p>
<p class=MsoNormal>+
cl_status_t
status;<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>
CL_ASSERT( p_timer );<o:p></o:p></p>
<p class=MsoNormal>
CL_ASSERT( p_timer->pfn_callback );<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_acquire(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal>+<o:p></o:p></p>
<p class=MsoNormal>
/* Calculate the timeout time in the timer object. */<o:p></o:p></p>
<p class=MsoNormal>
timeout_time = cl_get_time_stamp() + (((uint64_t)time_ms) * 1000);<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>
/* Only pull in the timeout time. */<o:p></o:p></p>
<p class=MsoNormal>-
if( p_timer->timeout_time && p_timer->timeout_time <
timeout_time )<o:p></o:p></p>
<p class=MsoNormal>+
if( p_timer->timeout_time && p_timer->timeout_time <
timeout_time ) {<o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_release(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal>
return( CL_SUCCESS );<o:p></o:p></p>
<p class=MsoNormal>+
}<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>-
return cl_timer_start( p_timer, time_ms );<o:p></o:p></p>
<p class=MsoNormal>+
status = cl_timer_start( p_timer, time_ms );<o:p></o:p></p>
<p class=MsoNormal>+
<o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_release(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal>+
return status;<o:p></o:p></p>
<p class=MsoNormal> }<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>-<o:p></o:p></p>
<p class=MsoNormal> void<o:p></o:p></p>
<p class=MsoNormal> cl_timer_stop(<o:p></o:p></p>
<p class=MsoNormal>
IN cl_timer_t*
const p_timer
)<o:p></o:p></p>
<p class=MsoNormal> {<o:p></o:p></p>
<p class=MsoNormal>
CL_ASSERT( p_timer );<o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_acquire(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>
if( p_timer->h_timer && p_timer->thread_id != GetCurrentThreadId()
)<o:p></o:p></p>
<p class=MsoNormal>
{<o:p></o:p></p>
<p class=MsoNormal>@@ -145,6 +161,8 @@<o:p></o:p></p>
<p class=MsoNormal>
p_timer->h_timer = NULL;<o:p></o:p></p>
<p class=MsoNormal>
}<o:p></o:p></p>
<p class=MsoNormal>
p_timer->timeout_time = 0;<o:p></o:p></p>
<p class=MsoNormal>+
<o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_release(&p_timer->spinlock);<o:p></o:p></p>
<p class=MsoNormal> }<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>Index:
B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/inc/complib/cl_timer.h<o:p></o:p></p>
<p class=MsoNormal>===================================================================<o:p></o:p></p>
<p class=MsoNormal>---
B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/inc/complib/cl_timer.h
(revision 5977)<o:p></o:p></p>
<p class=MsoNormal>+++
B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/inc/complib/cl_timer.h
(revision 5978)<o:p></o:p></p>
<p class=MsoNormal>@@ -45,8 +45,8 @@<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal> #include <complib/cl_types.h><o:p></o:p></p>
<p class=MsoNormal>+#include <complib/cl_spinlock.h><o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>-<o:p></o:p></p>
<p class=MsoNormal> /****h* Component Library/Timer<o:p></o:p></p>
<p class=MsoNormal> * NAME<o:p></o:p></p>
<p class=MsoNormal> *
Timer<o:p></o:p></p>
<p class=MsoNormal>Index:
B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/inc/user/complib/cl_timer_osd.h<o:p></o:p></p>
<p class=MsoNormal>===================================================================<o:p></o:p></p>
<p class=MsoNormal>---
B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/inc/user/complib/cl_timer_osd.h
(revision 5977)<o:p></o:p></p>
<p class=MsoNormal>+++
B:/users/tzachid/MLNX_WinOF-2_1_1/MLNX_WinOF-2_1_1/inc/user/complib/cl_timer_osd.h
(revision 5978)<o:p></o:p></p>
<p class=MsoNormal>@@ -47,8 +47,9 @@<o:p></o:p></p>
<p class=MsoNormal>
const
void
*context;<o:p></o:p></p>
<p class=MsoNormal>
uint64_t
timeout_time;<o:p></o:p></p>
<p class=MsoNormal>
DWORD
thread_id;<o:p></o:p></p>
<p class=MsoNormal>+
cl_spinlock_t
spinlock;<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal> } cl_timer_t;<o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal> <o:p></o:p></p>
<p class=MsoNormal>-#endif // _CL_TIMER_OSD_H_<o:p></o:p></p>
<p class=MsoNormal>\ No newline at end of file<o:p></o:p></p>
<p class=MsoNormal>+#endif
// _CL_TIMER_OSD_H_<o:p></o:p></p>
</div>
</div>
</div>
</body>
</html>