[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

Hal Rosenstock halr at voltaire.com
Tue May 30 08:17:10 PDT 2006


Don,

On Tue, 2006-05-30 at 10:55, Don.Albert at Bull.com wrote:
> Hal,
> 
> With your patch to OpenSM, I think everything is ok on the local node.

That patch with one minor change (elimination of the CL_ASSERT) will be
part of the upcoming RC6.

>   The remote node is definitely having some problems, resulting in not
> responding to the MAD packets.  I have entered a separate message on
> the problems with the "ib0" interface on that machine.

> > 
> > On Fri, 2006-05-26 at 20:59, Hal Rosenstock wrote:
> > > > What next, coach?
> > > 
> > > Can you turn on madeye on the remote node and see what packets are
> > > received and sent ? Let me know if you need help with that. I
> think you
> > > said you were running OFED, right ?
> > 
> 
> Yes, I am running kernel 2.6.16 with the OFED RC5 release.  I will
> investigate how to run madeye, but the hangs on the remote machine are
> probably the root cause of the link failure.

Ah; got it. It's tied into the other problem. Yes, when the hangs are
resolved, the SMA on the remote node will respond and I would expect the
port to get to active and you should be on your way then.

> > I don't think madeye is part of OFED :-( Can it get added for RC6,
> > Tziporet ? I think it would be a useful tool to add for problems
> like
> > this.
> > 
> > Also, was this a working setup before ? Did anything else change
> besides
> > installing RC5 on both nodes ?
> > 
> 
> This back to back setup was working originally with a backported
> 2.6.11-34 kernel and I believe it was revision 6500 from the OpenIB
> svn trunk at that time.  The problems started when I tried to move to
> RC4 and now RC5 of the OFED release, with the 2.6.16 kernel.
> 
> > I have two more experiments I'd like you to try, before we go down
> the
> > madeye "route":
> > 
> > 1. Do you have another IB cable to try ?
> > 
> > 2. Can you completely shutdown and repower the remote node and see
> if it
> > starts responding ?
> > 
> 
> It is difficult for me to debug this sort of thing, since I
> telecommute from Tucson and the machines are located in Phoenix.  But
> I can get someone there to power the machine down and reboot.

It's OK; you explained the state of the remote node so neither of those
experiments is necessary.

-- Hal

>  -Don Albert-
> 




More information about the ewg mailing list