<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:p="urn:schemas-microsoft-com:office:powerpoint" xmlns:a="urn:schemas-microsoft-com:office:access" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" xmlns:b="urn:schemas-microsoft-com:office:publisher" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:odc="urn:schemas-microsoft-com:office:odc" xmlns:oa="urn:schemas-microsoft-com:office:activation" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:q="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rtc="http://microsoft.com/officenet/conferencing" xmlns:D="DAV:" xmlns:Repl="http://schemas.microsoft.com/repl/" xmlns:mt="http://schemas.microsoft.com/sharepoint/soap/meetings/" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:ppda="http://www.passport.com/NameSpace.xsd" xmlns:ois="http://schemas.microsoft.com/sharepoint/soap/ois/" xmlns:dir="http://schemas.microsoft.com/sharepoint/soap/directory/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:dsp="http://schemas.microsoft.com/sharepoint/dsp" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sub="http://schemas.microsoft.com/sharepoint/soap/2002/1/alerts/" xmlns:ec="http://www.w3.org/2001/04/xmlenc#" xmlns:sp="http://schemas.microsoft.com/sharepoint/" xmlns:sps="http://schemas.microsoft.com/sharepoint/soap/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmlfile" xmlns:udcp2p="http://schemas.microsoft.com/data/udc/parttopart" xmlns:wf="http://schemas.microsoft.com/sharepoint/soap/workflow/" xmlns:dsss="http://schemas.microsoft.com/office/2006/digsig-setup" xmlns:dssi="http://schemas.microsoft.com/office/2006/digsig" xmlns:mdssi="http://schemas.openxmlformats.org/package/2006/digital-signature" xmlns:mver="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns:mrels="http://schemas.openxmlformats.org/package/2006/relationships" xmlns:spwp="http://microsoft.com/sharepoint/webpartpages" xmlns:ex12t="http://schemas.microsoft.com/exchange/services/2006/types" xmlns:ex12m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:pptsl="http://schemas.microsoft.com/sharepoint/soap/SlideLibrary/" xmlns:spsl="http://microsoft.com/webservices/SharePointPortalServer/PublishedLinksService" xmlns:Z="urn:schemas-microsoft-com:" xmlns:st="" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Courier New";}
.MsoChpDefault
{mso-style-type:export-only;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal>Using OFED 1.5.0 and 1.5.1 we’ve been seeing nodes
occasionally hang when a process tries to disconnect from the umad interface.
Can anyone suggest what might be causing this?<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>Here’s a typical example:<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: qlgc_dsc D ffffffff80148c54
0 5478 <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'> 1
5497 5477 (NOTLB)<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: ffff81042b785dd8 0000000000000082
000000000062f388 00000000437b2038<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: 0000000000000000 000000000000000a
ffff81043fa3f040 ffff81043fb6e100<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: 00003463ec0fbcd0 0000000000003720
ffff81043fa3f228 000000080062f388<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: Call Trace:<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: [<ffffffff8003dd13>]
do_futex+0x282/0xc3f<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: [<ffffffff80063206>]
wait_for_completion+0x79/0xa2<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: [<ffffffff8008a461>]
default_wake_function+0x0/0xe<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel:
[<ffffffff88318399>]:ib_mad:ib_cancel_rmpp_recvs+0xa6/0xe9<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel:
[<ffffffff883155f1>]:ib_mad:ib_unregister_mad_agent+0x30d/0x424<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel:
[<ffffffff8850d24e>]:ib_umad:ib_umad_unreg_agent+0x6f/0x94<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel:
[<ffffffff8850db71>]:ib_umad:ib_umad_ioctl+0x4a/0x5d<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: [<ffffffff80041b2e>] do_ioctl+0x21/0x6b<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: [<ffffffff8002fd1e>]
vfs_ioctl+0x248/0x261<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: [<ffffffff8004c0a3>] sys_ioctl+0x59/0x78<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier New"'>Apr
29 10:01:37 st2139 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0<o:p></o:p></span></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>Reviewing the code, the problem is that, basically,
ib_cancel_rmpp_recvs is waiting for a completion() to occur, but the
completion() is never getting invoked, presumably because the reference count
is wrong on one of the rmpp structures:<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>static inline void deref_rmpp_recv(struct mad_rmpp_recv
*rmpp_recv)<o:p></o:p></p>
<p class=MsoNormal>{<o:p></o:p></p>
<p class=MsoNormal>
if (atomic_dec_and_test(&rmpp_recv->refcount))<o:p></o:p></p>
<p class=MsoNormal>
complete(&rmpp_recv->comp);<o:p></o:p></p>
<p class=MsoNormal>}<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>static void destroy_rmpp_recv(struct mad_rmpp_recv *rmpp_recv)<o:p></o:p></p>
<p class=MsoNormal>{<o:p></o:p></p>
<p class=MsoNormal>
deref_rmpp_recv(rmpp_recv);<o:p></o:p></p>
<p class=MsoNormal>
wait_for_completion(&rmpp_recv->comp);<o:p></o:p></p>
<p class=MsoNormal>
ib_destroy_ah(rmpp_recv->ah);<o:p></o:p></p>
<p class=MsoNormal>
kfree(rmpp_recv);<o:p></o:p></p>
<p class=MsoNormal>}<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>Reviewing our internal bugs database, I actually found that this
problem has actually been around for several years, but we were never able to
reproduce it under controlled circumstances. Most frequently, the problem
occurred when trying to unload a module. Here’s an example that was
captured in 2007:<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<pre>rmmod D ffff81003af6fd60 0 22020 21962<o:p></o:p></pre><pre> ffff81003b017c68 0000000000000082 ffffffff813a22a8 ffff81003b017c88<o:p></o:p></pre><pre> ffff81003b017c90 ffff81003ab39800 ffff81003fba6800 ffff81003ab39a68<o:p></o:p></pre><pre> 000000013b017c58 ffffffff8126b945 0000000000000001 ffffffff81042433<o:p></o:p></pre><pre>Call Trace:<o:p></o:p></pre><pre> [<ffffffff8126b945>] wait_for_completion+0xa0/0xb3<o:p></o:p></pre><pre> [<ffffffff81042433>] flush_cpu_workqueue+0x29/0x6f<o:p></o:p></pre><pre> [<ffffffff8102def5>] default_wake_function+0x0/0xe<o:p></o:p></pre><pre> [<ffffffff8126b92f>] wait_for_completion+0x8a/0xb3<o:p></o:p></pre><pre> [<ffffffff8102def5>] default_wake_function+0x0/0xe<o:p></o:p></pre><pre> [<ffffffff881271d7>] :ib_mad:ib_cancel_rmpp_recvs+0x8a/0xdf<o:p></o:p></pre><pre> [<ffffffff88124475>] :ib_mad:ib_unregister_mad_agent+0x333/0x445<o:p></o:p></pre><pre> [<ffffffff8812f0d0>] :ib_sa:free_sm_ah+0x0/0x17<o:p></o:p></pre><pre> [<ffffffff88125e90>] :ib_mad:ib_agent_port_close+0x7c/0x8b<o:p></o:p></pre><pre> [<ffffffff8812245b>] :ib_mad:ib_mad_remove_device+0x38/0x85<o:p></o:p></pre><pre> [<ffffffff880fbf20>] :ib_core:ib_unregister_device+0x30/0xc4<o:p></o:p></pre><pre> [<ffffffff8817033c>] :ib_ipath:ipath_unregister_ib_device+0x59/0x282<o:p></o:p></pre><pre> [<ffffffff88152e69>] :ib_ipath:ipath_remove_one+0x75/0x474<o:p></o:p></pre><pre> [<ffffffff81122d01>] pci_device_remove+0x24/0x48<o:p></o:p></pre><pre> [<ffffffff811885aa>] __device_release_driver+0x8e/0xb0<o:p></o:p></pre><pre> [<ffffffff81188ae8>] driver_detach+0xce/0x10e<o:p></o:p></pre><pre> [<ffffffff81188053>] bus_remove_driver+0x6d/0x90<o:p></o:p></pre><pre> [<ffffffff81122f53>] pci_unregister_driver+0x10/0x5f<o:p></o:p></pre><pre> [<ffffffff8817da5f>] :ib_ipath:infinipath_cleanup+0x3f/0x4c<o:p></o:p></pre><pre> [<ffffffff81050d23>] sys_delete_module+0x196/0x1c5<o:p></o:p></pre>
<p class=MsoNormal><o:p> </o:p></p>
</div>
</body>
</html>