<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; ">
<div><br>
</div>
<div>We have been running our HPC cluster on RHEL 6.5 with OFED 2.4 for several months. Suddenly we are seeing errors on our compute nodes like this:</div>
<div><br>
</div>
<div>
<div>Mar 29 09:29:17 cnode301 kernel: mlx4_core 0000:87:00.0: CQ overrun on CQN 00009c</div>
<div>Mar 29 09:30:54 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..</div>
<div>Mar 29 09:30:59 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..</div>
<div>Mar 29 09:31:04 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..</div>
<div>Mar 29 09:31:09 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..</div>
<div>Mar 29 09:31:14 cnode301 kernel: ib0: timing out; 1 sends not completed still waiting..</div>
<div>Mar 29 09:31:14 cnode301 kernel: ib0: ipoib_cm_tx_destroy: 1 not completed force cleanup.</div>
</div>
<div><br>
</div>
<div>It is at this point that the compute node has to be rebooted. </div>
<div><br>
</div>
<div>The user is not running MPI code, though presumably he is doing I/O to the cluster filesystem which is mounted over the Infiniband network. The particular application that is running seems to provoke this error more often than anything else, so we assume
the app is hitting a bug somewhere within our cluster config (firmware or filesystem or OS or OFED stack).</div>
<div><br>
</div>
<div>Is an error like this a result of a driver issue, firmware issue, or something else?</div>
<div><br>
</div>
<div>Any suggestions on where to look to find the problem would be appreciated.</div>
<div><br>
</div>
<div>Thanks so much!</div>
<div>-Roger</div>
<div>
<div><br>
</div>
<div><span style="color: rgb(31, 73, 125); font-size: 11pt; ">===============================</span></div>
<div>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">Roger V. Moye<o:p></o:p></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">UNIX Systems Administrator<o:p></o:p></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">XSEDE Campus Champion<o:p></o:p></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">University of Texas - MD Anderson Cancer Center<o:p></o:p></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">Research Information Systems and Technology Services</span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">Houston, Texas<o:p></o:p></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">1MC 13.2430</span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">(713) 792-2134</span></p>
<p class="MsoNormal" style="margin: 0in 0in 0.0001pt; font-size: 11pt; "><span style="color: rgb(31, 73, 125); ">===============================</span></p>
</div>
</div>
<p class="MsoNormal">The information contained in this e-mail message may be
privileged, confidential, and/or protected from disclosure. This e-mail message
may contain protected health information (PHI); dissemination of PHI should
comply with applicable federal and state laws. If you are not the intended recipient,
or an authorized representative of the intended recipient, any further review,
disclosure, use, dissemination, distribution, or copying of this message or any
attachment (or the information contained therein) is strictly prohibited. If
you think that you have received this e-mail message in error, please notify
the sender by return e-mail and delete all references to it and its contents
from your systems.<o:p></o:p></p></body>
</html>