[ofa-general] Re: Re: crash in ipoib

Michael S. Tsirkin mst at dev.mellanox.co.il
Wed Jun 13 11:09:49 PDT 2007


> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: Re: crash in ipoib
> 
> >This looks strange. Can you supply some more data please?
> >Which HCA are you running on?
> >What test are you running?
> >What should I do to reproduce this?
> >Further, could you supply the full oops?
> 
> Woody will need to answer the test/config questions.  The oops is only displayed
> on the screen, and the stack trace is about 50-75 calls long.  The start of the
> oops gets pushed off the screen.  (Can we be overrunning the stack?)  I'm not at
> the systems today, but can probably get what else is available tomorrow.

Getting a serial console would be the thing to do then.
If you are worried about stack overflow, build your kernel
with stack instrumentation.
It's quite likely the real oops reason has scrolled off the screen,
what you post here could be thre result of fullowing memory corruption.

> We have, I think, up to 16 systems running the tests, and we only see failures
> on specific nodes (which all happen to be the same type of system
> ).

One thing to try to check is whether it's kernel-specific.
What happens if you install a different kernel/OS there?
Try RHEL5 or just build 2.6.20 kernel there.
Does it still happen?

-- 
MST



More information about the general mailing list