<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.6000.16809" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff size=2><SPAN
class=104035415-28042009>Hi Leonid,</SPAN></FONT></DIV>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff size=2><SPAN
class=104035415-28042009> I think you want to move the cl_free() after the
deref_al_obj() call.</SPAN></FONT></DIV>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff size=2><SPAN
class=104035415-28042009></SPAN></FONT> </DIV>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff size=2><SPAN
class=104035415-28042009>stan.</SPAN></FONT></DIV>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff
size=2></FONT> </DIV>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff size=2>@@ -2354,6 +2357,8
@@<BR> if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt )
)<BR> cl_async_proc_queue( gp_async_pnp_mgr,
&gp_ioc_pnp->async_item );<BR> <FONT
color=#ff0000>cl_free( p_results );</FONT><BR>+ /* Release the
reference taken for the query. */<BR>+ deref_al_obj(
&p_results->p_svc->obj
);<BR> }<BR> <BR> AL_EXIT( AL_DBG_PNP
);<BR></DIV></FONT><BR>
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Leonid Keller
[mailto:leonid@mellanox.co.il] <BR><B>Sent:</B> Monday, April 27, 2009 5:38
AM<BR><B>To:</B> Leonid Keller; Fab Tillier; Smith, Stan<BR><B>Cc:</B>
ofw@lists.openfabrics.org<BR><B>Subject:</B> RE: [ofw] crash on IBBUS disabling
while mad traffic<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV><SPAN class=818171912-27042009><FONT face=Arial color=#0000ff size=2>Here
is a possible explanation and a fix. Please, review.</FONT></SPAN></DIV>
<DIV><SPAN class=818171912-27042009><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=818171912-27042009><FONT face=Arial color=#0000ff
size=2>__ioc_query_sa takes references on IOC PnP service before sending the
node and path_record requests.</FONT></SPAN></DIV>
<DIV><SPAN class=818171912-27042009><FONT face=Arial size=2><FONT
color=#0000ff>But these </FONT><FONT color=#0000ff>references get released at
the end of __node_rec_cb and __path_rec_cb, while __process_sweep routine, which
performs the IOU sweeping, is just scheduled to run in an async
thread.</FONT></FONT></SPAN></DIV>
<DIV><SPAN class=818171912-27042009><FONT face=Arial color=#0000ff size=2>If the
test happens to unload the driver after __node_rec_cb and __path_rec_cb and
before __process_sweep started to run, IOC PnP service gets released and
__process_sweep crashes.</FONT></SPAN></DIV>
<DIV><SPAN class=818171912-27042009><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=818171912-27042009>The
patch takes a reference on IOC PnP service before scheduling a thread for
__process_sweep and releases the reference at the end of
__process_sweep.</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN class=818171912-27042009>(Pay
attention, that __process_sweep schedules a thread for
itself twice while moving through its FSM: </SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2><SPAN
class=818171912-27042009>SWEEP_IOU_INFO --> SWEEP_IOC_PROFILE -->
SWEEP_SVC_ENTRIES --> SWEEP_COMPLETE)</SPAN></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial color=#0000ff size=2>Index:
al/kernel/al_ioc_pnp.c<BR>===================================================================<BR>---
al/kernel/al_ioc_pnp.c (revision 3609)<BR>+++
al/kernel/al_ioc_pnp.c (working copy)<BR>@@ -2231,8 +2231,11
@@<BR> * If this is the last MAD, finish processing the IOU
queries<BR> * in the PnP thread.<BR> */<BR>- if(
!cl_atomic_dec( &p_results->p_svc->query_cnt ) )<BR>+ if(
!cl_atomic_dec( &p_results->p_svc->query_cnt ) ) {<BR>+ /*
Reference the service till the end of processing in the thread
*/<BR>+ ref_al_obj( &p_results->p_svc->obj
);<BR> cl_async_proc_queue( gp_async_pnp_mgr,
&p_results->async_item );<BR>+ }<BR> <BR> AL_EXIT(
AL_DBG_PNP );<BR> }<BR>@@ -2354,6 +2357,8 @@<BR> if(
!cl_atomic_dec( &gp_ioc_pnp->query_cnt )
)<BR> cl_async_proc_queue( gp_async_pnp_mgr,
&gp_ioc_pnp->async_item );<BR> cl_free( p_results
);<BR>+ /* Release the reference taken for the query.
*/<BR>+ deref_al_obj( &p_results->p_svc->obj
);<BR> }<BR> <BR> AL_EXIT( AL_DBG_PNP
);<BR></FONT></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><BR></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> Leonid Keller <BR><B>Sent:</B> Sunday,
April 26, 2009 1:05 AM<BR><B>To:</B> 'Fab Tillier'; 'Smith,
Stan'<BR><B>Cc:</B> ofw@lists.openfabrics.org<BR><B>Subject:</B> [ofw] crash
on IBBUS disabling while mad traffic<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV><FONT face=Arial size=2><SPAN class=654294220-25042009>I've got a crash
while running WHQL Disable Enable test while opensm was running on another
node.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=654294220-25042009>I was
running a December version of the driver, but i'm not sure this will work
with current one. (i'll try)</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=654294220-25042009></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=654294220-25042009>The test, which
makes disable/enable to all devices, passes without
opensm.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=654294220-25042009>With opensm IBBUS
sends SA requests to opensm.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=654294220-25042009>In this case
</SPAN></FONT><FONT face=Arial><FONT size=2>__process_sweep<SPAN
class=654294220-25042009>() fails, because per-port IOC PnP agent seems to be
already released.</SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=654294220-25042009>The latter
is strange, because __ioc_query_sa takes reference on PnP agent before sending
request.</SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN
class=654294220-25042009> __ioc_query_sa<BR> __node_rec_cb<BR> __process_query<BR> __process_sweep<BR></SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN class=654294220-25042009>Any ideas
?</SPAN></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><SPAN
class=654294220-25042009></SPAN></FONT></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>3: kd> !analyze -v<BR>ERROR: FindPlugIns
8007007b<BR>*******************************************************************************<BR>*
*<BR>*
Bugcheck
Analysis
*<BR>*
*<BR>*******************************************************************************</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL
(d5)<BR>Memory was referenced after it was freed.<BR>This cannot be protected
by try-except.<BR>When possible, the guilty driver's name (Unicode string) is
printed on<BR>the bugcheck screen and saved in
KiBugCheckDriver.<BR>Arguments:<BR>Arg1: fffff98005b72f84, memory
referenced<BR>Arg2: 0000000000000000, value 0 = read operation, 1 = write
operation<BR>Arg3: fffffa600400b1d0, if non-zero, the address which referenced
memory.<BR>Arg4: 0000000000000000, (reserved)</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>Debugging
Details:<BR>------------------</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>Matched: ibbus!proxy_ioctl+0x41
(fffffa60`04031d8d) <BR>Matched: ibbus!proxy_ioctl+0xa5 (fffffa60`04031df1)
</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>READ_ADDRESS: fffff98005b72f84 Special
pool</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>FAULTING_IP: <BR>ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]<BR>fffffa60`0400b1d0 83b8d400000003 cmp
dword ptr [rax+0D4h],3</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>MM_INTERNAL_CODE: 0</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>IMAGE_NAME: ibbus.sys</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>DEBUG_FLR_IMAGE_TIMESTAMP:
49401b3e</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>MODULE_NAME: ibbus</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>FAULTING_MODULE: fffffa6004002000
ibbus</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>DEFAULT_BUCKET_ID:
VISTA_DRIVER_FAULT</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>BUGCHECK_STR: 0xD5</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>PROCESS_NAME: System</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>CURRENT_IRQL: f</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>TRAP_FRAME: fffffa6003d50b00 -- (.trap
0xfffffa6003d50b00)<BR>NOTE: The trap frame does not contain all
registers.<BR>Some register values may be zeroed or
incorrect.<BR>rax=fffff98005b72eb0 rbx=0000000000000000
rcx=fffffa6004057780<BR>rdx=fffffa6004005e97 rsi=fffffa600199ccc0
rdi=fffff80001cc0304<BR>rip=fffffa600400b1d0 rsp=fffffa6003d50c90
rbp=0000000000000080<BR> r8=0000000000000005 r9=fffffa6004005e97
r10=0000000000000001<BR>r11=fffffa6003d50c50 r12=0000000000000000
r13=0000000000000000<BR>r14=0000000000000000
r15=0000000000000000<BR>iopl=0
nv up ei pl zr na po nc<BR>ibbus!__process_sweep+0x44:<BR>fffffa60`0400b1d0
83b8d400000003 cmp dword ptr [rax+0D4h],3
ds:fffff980`05b72f84=????????<BR>Resetting default scope</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>LAST_CONTROL_TRANSFER: from
fffff80001969c42 to fffff800018b0b30</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>STACK_TEXT: <BR>fffffa60`03d502f8
fffff800`01969c42 : fffffa80`0e0eb290 fffff800`0194893d fffff800`01a55140
00000000`00001000 : nt!RtlpBreakWithStatusInstruction<BR>fffffa60`03d50300
fffff800`0196adb7 : fffff800`00000004 fffff800`01a55140 ffffffff`fffff000
00000000`00000050 : nt!KiBugCheckDebugBreak+0x12<BR>fffffa60`03d50360
fffff800`018b6754 : fffffa80`0dd77480 fffff800`01cc2bb9 00000000`00000000
fffff800`0194c13f : nt!KeBugCheck2+0xaa7<BR>fffffa60`03d509d0
fffff800`018c5671 : 00000000`00000050 fffff980`05b72f84 00000000`00000000
fffffa60`03d50b00 : nt!KeBugCheckEx+0x104<BR>fffffa60`03d50a10
fffff800`018b51d9 : 00000000`00000000 fffff980`0427cf78 fffffa80`0e0ecf00
fffff980`1c27ef40 : nt!MmAccessFault+0x1371<BR>fffffa60`03d50b00
fffffa60`0400b1d0 : fffff980`1c27ef40 fffff980`04318e00 fffffa60`04005eba
fffff980`04318e78 : nt!KiPageFault+0x119<BR>fffffa60`03d50c90
fffffa60`04005e9d : fffff980`04318e98 fffff980`043bccb0 fffff980`1b88afd0
fffff980`04318e78 : ibbus!__process_sweep+0x44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]<BR>fffffa60`03d50cc0 fffffa60`040070d9 : fffff980`04318d60
fffff980`0434afd0 00000000`00000000 fffffa60`0400743c :
ibbus!__cl_async_proc_worker+0x61
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @
153]<BR>fffffa60`03d50cf0 fffffa60`04007464 : fffff980`0434afd0
00000000`00000080 fffff980`0434afd0 8b8b8b8b`8b8b8b8b :
ibbus!__cl_thread_pool_routine+0x41
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @
66]<BR>fffffa60`03d50d20 fffff800`01adafd3 : 8b8b8b8b`8b8b8b8b
8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b01 :
ibbus!__thread_callback+0x28
[s:\builds\3609\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c @
49]<BR>fffffa60`03d50d50 fffff800`018f0816 : fffffa60`01999180
fffffa80`0e0eb290 fffffa60`019a2d40 00000000`00000001 :
nt!PspSystemThreadStartup+0x57<BR>fffffa60`03d50d80 00000000`00000000 :
00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 :
nt!KiStartSystemThread+0x16</FONT></DIV>
<DIV> </DIV><FONT face=Arial size=2>
<DIV><BR>STACK_COMMAND: kb</DIV>
<DIV> </DIV>
<DIV>FOLLOWUP_IP: <BR>ibbus!__process_sweep+44
[s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @
2315]<BR>fffffa60`0400b1d0 83b8d400000003 cmp
dword ptr [rax+0D4h],3</DIV>
<DIV> </DIV>
<DIV>FAULTING_SOURCE_CODE: <BR> 2311: <BR> 2312:
p_results = PARENT_STRUCT( p_async_item, ioc_sweep_results_t, async_item
);<BR> 2313: CL_ASSERT( !p_results->p_svc->query_cnt
);<BR> 2314: <BR>> 2315: if( p_results->p_svc->obj.state
== CL_DESTROYING )<BR> 2316: {<BR> 2317:
__put_iou_map( gp_ioc_pnp, &p_results->iou_map );<BR>
2318: goto err;<BR> 2319: }<BR> 2320: </DIV>
<DIV> </DIV>
<DIV><BR>SYMBOL_STACK_INDEX: 6</DIV>
<DIV> </DIV>
<DIV>SYMBOL_NAME: ibbus!__process_sweep+44</DIV>
<DIV> </DIV>
<DIV>FOLLOWUP_NAME: MachineOwner</DIV>
<DIV> </DIV>
<DIV>FAILURE_BUCKET_ID: X64_0xD5_VRF_ibbus!__process_sweep+44</DIV>
<DIV> </DIV>
<DIV>BUCKET_ID: X64_0xD5_VRF_ibbus!__process_sweep+44</DIV>
<DIV> </DIV>
<DIV>Followup: MachineOwner<BR>---------</DIV>
<DIV> </DIV>
<DIV></FONT> </DIV></BLOCKQUOTE></BODY></HTML>