[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

Hal Rosenstock halr at voltaire.com
Tue May 30 10:14:41 PDT 2006


Hi Paul,

On Tue, 2006-05-30 at 11:06, Paul wrote:
> Hi All,
>      I will be working on this as time permits this week.
> Unfortunately my employer is not crazy about giving out remote access,
> so I will have to be your hands on this. If you want me to do
> something just tell me what it is. I know its a pain I have been there
> myself. 

I should have access to a G5 in a day or so so let me see if I can
recreate this.

-- Hal

> Regards.
> 
> On 5/30/06, Don.Albert at bull.com <Don.Albert at bull.com > wrote:
>         Hal,
>         
>         With your patch to OpenSM, I think everything is ok on the
>         local node.  The remote node is definitely having some
>         problems, resulting in not responding to the MAD packets.  I
>         have entered a separate message on the problems with the "ib0"
>         interface on that machine.
>         
>         > 
>         > On Fri, 2006-05-26 at 20:59, Hal Rosenstock wrote:
>         > > > What next, coach?
>         > > 
>         > > Can you turn on madeye on the remote node and see what
>         packets are
>         > > received and sent ? Let me know if you need help with
>         that. I think you
>         > > said you were running OFED, right ?
>         > 
>         
>         
>         Yes, I am running kernel 2.6.16 with the OFED RC5 release.  I
>         will investigate how to run madeye, but the hangs on the
>         remote machine are probably the root cause of the link
>         failure.
>         
>         > I don't think madeye is part of OFED :-( Can it get added
>         for RC6,
>         > Tziporet ? I think it would be a useful tool to add for
>         problems like
>         > this.
>         > 
>         > Also, was this a working setup before ? Did anything else
>         change besides
>         > installing RC5 on both nodes ?
>         > 
>         
>         
>         This back to back setup was working originally with a
>         backported 2.6.11-34 kernel and I believe it was revision 6500
>         from the OpenIB svn trunk at that time.  The problems started
>         when I tried to move to RC4 and now RC5 of the OFED release,
>         with the 2.6.16 kernel.
>         
>         > I have two more experiments I'd like you to try, before we
>         go down the
>         > madeye "route":
>         > 
>         > 1. Do you have another IB cable to try ?
>         > 
>         > 2. Can you completely shutdown and repower the remote node
>         and see if it
>         > starts responding ?
>         > 
>         
>         
>         It is difficult for me to debug this sort of thing, since I
>         telecommute from Tucson and the machines are located in
>         Phoenix.  But I can get someone there to power the machine
>         down and reboot.
>         
>          -Don Albert-
>         
>         
>         _______________________________________________
>         openib-general mailing list
>         openib-general at openib.org
>         http://openib.org/mailman/listinfo/openib-general
>         
>         To unsubscribe, please visit
>         http://openib.org/mailman/listinfo/openib-general
>         
> 




More information about the general mailing list