Hi All,<br> I will be working on this as time permits this week. Unfortunately my employer is not crazy about giving out remote access, so I will have to be your hands on this. If you want me to do something just tell me what it is. I know its a pain I have been there myself.
<br><br>Regards.<br><br><div><span class="gmail_quote">On 5/30/06, <b class="gmail_sendername"><a href="mailto:Don.Albert@bull.com">Don.Albert@bull.com</a></b> <<a href="mailto:Don.Albert@bull.com">Don.Albert@bull.com</a>
> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div>
<br><font face="sans-serif" size="2">Hal,</font>
<br>
<br><font size="2"><tt>With your patch to OpenSM, I think everything is ok
on the local node. The remote node is definitely having some problems,
resulting in not responding to the MAD packets. I have entered a
separate message on the problems with the "ib0" interface on
that machine.<br>
</tt></font></div><div><span class="q">
<br><font size="2"><tt>> <br>
> On Fri, 2006-05-26 at 20:59, Hal Rosenstock wrote:<br>
> > > What next, coach?<br>
> > <br>
> > Can you turn on madeye on the remote node and see what packets
are<br>
> > received and sent ? Let me know if you need help with that. I
think you<br>
> > said you were running OFED, right ?<br>
> </tt></font>
<br>
<br></span></div><div><font size="2"><tt>Yes, I am running kernel 2.6.16 with the OFED RC5
release. I will investigate how to run madeye, but the hangs on the
remote machine are probably the root cause of the link failure.</tt></font>
</div><div><span class="q"><br>
<br><font size="2"><tt>> I don't think madeye is part of OFED :-( Can
it get added for RC6,<br>
> Tziporet ? I think it would be a useful tool to add for problems like<br>
> this.<br>
> <br>
> Also, was this a working setup before ? Did anything else change besides<br>
> installing RC5 on both nodes ?<br>
> </tt></font>
<br>
<br></span></div><div><font size="2"><tt>This back to back setup was working originally with
a backported 2.6.11-34 kernel and I believe it was revision 6500 from the
OpenIB svn trunk at that time. The problems started when I tried
to move to RC4 and now RC5 of the OFED release, with the 2.6.16 kernel.</tt></font>
</div><div><span class="q"><br><font size="2"><tt><br>
> I have two more experiments I'd like you to try, before we go down
the<br>
> madeye "route":<br>
> <br>
> 1. Do you have another IB cable to try ?<br>
> <br>
> 2. Can you completely shutdown and repower the remote node and see
if it<br>
> starts responding ?<br>
> </tt></font>
<br>
<br></span></div><div><font size="2"><tt>It is difficult for me to debug this sort of thing,
since I telecommute from Tucson and the machines are located in Phoenix.
But I can get someone there to power the machine down and reboot.</tt></font>
<br><font size="2"><tt><br>
-Don Albert-</tt></font>
<br>
<br>
</div><br>_______________________________________________<br>openib-general mailing list<br><a onclick="return top.js.OpenExtLink(window,event,this)" href="mailto:openib-general@openib.org">openib-general@openib.org</a><br>
<a onclick="return top.js.OpenExtLink(window,event,this)" href="http://openib.org/mailman/listinfo/openib-general" target="_blank">http://openib.org/mailman/listinfo/openib-general</a><br><br>To unsubscribe, please visit
<a onclick="return top.js.OpenExtLink(window,event,this)" href="http://openib.org/mailman/listinfo/openib-general" target="_blank">http://openib.org/mailman/listinfo/openib-general</a><br><br></blockquote></div><br>