I found the error in our machine. We had an intermittent connection in one node's HCA card. I just happened to have looked at that node when the HCA was not found in `lscpi` or `proc`. I reset the card on its bus and kaboom... success. Thanks everyone for all your help.
<br><br><div><span class="gmail_quote">On 8/23/07, <b class="gmail_sendername">John Leidel</b> <<a href="mailto:john.leidel@gmail.com">john.leidel@gmail.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Whats especially odd is that I can get a full bandwidth ping pong test running fine [970MB/s++], then rerun the test and have it fail saying it can't find the IB HCA. <br><br><br><div><span class="q"><span class="gmail_quote">
On 8/23/07,
<b class="gmail_sendername">Tziporet Koren</b> <<a href="mailto:tziporet@mellanox.co.il" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">tziporet@mellanox.co.il</a>> wrote:</span></span><div>
<span class="e" id="q_114936a3ec06c81f_2"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div>
<div><font face="Arial"><font size="2">John Leidel wrote:<span> </span></font></font></div>
<div><span><font face="Arial" size="2"></font></span> </div>
<div><span><font face="Arial"><font size="2"><span>> </span>Unfortunately, the RDMA module load
didn't help... a simple "hello_world" application still returns
:: <br><span>> </span><br><span>> </span>libibverbs: Fatal: no infiniband
class devices found.<br><span>> </span>No IB
device found<br><span>> </span><br><span>> </span>I went and verified that all the
nodes see the HCAs... an lspci on all nodes reports :: <br><span>> </span><br><span>> </span>07:00.0 InfiniBand: Mellanox
Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev
a0)<br><span>> </span>
Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility
mode)<br></font></font><br></span>Can you run:<br>/etc/init.d/openibd restart <br>and
send the /var/log/messages <br><br>Thanks<br>Tziporet</div></div>
</blockquote></span></div></div><br>
</blockquote></div><br>