[ofa-general] RE: Re: crash in ipoib

Woodruff, Robert J robert.j.woodruff at intel.com
Wed Jun 13 12:29:17 PDT 2007


We are running on a RHEL EL4 2.6.9-42EL kernel on a rocks install.

The tests I run are IMB with Intel MPI over uDAPL and at the same
time as IMB over IPopIB. It usiually takes at least 1 day sometimes 2 
days of running IMB in a loop with various number of processes per node,
1,2, and 4. It seems to fail randomly, not on the same
node everytime, so it is not feasible to connect a serial console 
to every node. It would also be hard for us to put in a new kernel
as this has problems with rocks. The systems are the older Xeon,
Lindenhurst, 3.6Ghz

I have not seen this error on any other kernel or system, I have tested
RHEL5 and RHEL4-U5, but only on 2 nodes, but that does not seem 
to fail. We also having OFED 1.2 running on a 64 and 256 node production
applications
development clusters and they have not reported any similar problems,
but they
are not running the same tests. 

I plan on loading OFED 1.2-rc5 today. Is there an easy way to build the 
IPoIB driver from the OFED installer so that it has debug enabled ?

 woody

-----Original Message-----
From: Michael S. Tsirkin [mailto:mst at dev.mellanox.co.il] 
Sent: Wednesday, June 13, 2007 11:10 AM
To: Hefty, Sean
Cc: 'Michael S. Tsirkin'; Sean Hefty; Woodruff, Robert J; 'Vladimir
Sokolovsky'; general at lists.openfabrics.org
Subject: Re: Re: crash in ipoib

> Quoting Sean Hefty <sean.hefty at intel.com>:
> Subject: RE: Re: crash in ipoib
> 
> >This looks strange. Can you supply some more data please?
> >Which HCA are you running on?
> >What test are you running?
> >What should I do to reproduce this?
> >Further, could you supply the full oops?
> 
> Woody will need to answer the test/config questions.  The oops is only
displayed
> on the screen, and the stack trace is about 50-75 calls long.  The
start of the
> oops gets pushed off the screen.  (Can we be overrunning the stack?)
I'm not at
> the systems today, but can probably get what else is available
tomorrow.

Getting a serial console would be the thing to do then.
If you are worried about stack overflow, build your kernel
with stack instrumentation.
It's quite likely the real oops reason has scrolled off the screen,
what you post here could be thre result of fullowing memory corruption.

> We have, I think, up to 16 systems running the tests, and we only see
failures
> on specific nodes (which all happen to be the same type of system
> ).

One thing to try to check is whether it's kernel-specific.
What happens if you install a different kernel/OS there?
Try RHEL5 or just build 2.6.20 kernel there.
Does it still happen?

-- 
MST



More information about the general mailing list