[openib-general] nightly osm_sim report 2006-12-14:normal completion

Thu Dec 14 12:18:07 PST 2006

On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote:
> Update on analysis of failures:
> 
> Eitan Zahavi wrote:
> > Hal Rosenstock wrote:
> >   
> >> Hi Eitan,
> >>
> >> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
> >>   
> >>     
> >>> OSM Simulation Regression Summary
> >>> OpenSM rev = ____  
> >>> ibutils rev = ____  
> >>> Total=264 Pass=261 Fail=3
> >>>
> >>> Pass:
> >>> 36 Stability IS1-16.topo
> >>> 36 Pkey IS1-16.topo
> >>> 36 Multicast IS1-16.topo
> >>> 36 LidMgr IS1-16.topo
> >>> 35 OsmStress IS1-16.topo
> >>> 12 Stability IS3-loop.topo
> >>> 12 Stability IS3-128.topo
> >>> 12 Pkey IS3-128.topo
> >>> 12 OsmStress IS3-128.topo
> >>> 12 Multicast IS3-loop.topo
> >>> 11 Multicast IS3-128.topo
> >>> 11 LidMgr IS3-128.topo
> >>>
> >>> Failures:
> >>> 1 OsmStress IS1-16.topo
> >>>       
> Job was killed in the middle. Just an accident.

Is that always the case ? This one has been consistently failing.
I think you had written something about this failure back in July. I can
dig it out if you want.

> >>> 1 Multicast IS3-128.topo
> >>>       
> A single packet was dropped on the way to the SM. Still not clear where.
> However, I have seen a perfectly good link reported by the drop manager 
> as missing.

I think I may have seen this as well on some rare occasions. I could
never figure out why this happened.

> I will rerun some tests with valgrind as  I think this might be a memory 
> corruption issue.

OK.

> >>> 1 LidMgr IS3-128.topo
> >>>       
> Seems like the last sweep started before the last change in LID was 
> made. So it missed one of the nodes.
> Additional sweep was enforced at the end of the test - just to make sure 
> all changes are handled.

So is this being reported as a failure improperly then ?

-- Hal

> >>>     
> >>>       
> >> There are now 2 more failures. You had previously explained OsmStress
> >> failure as needing more investigation. Now there is a Multicast and
> >> LidMgr failure yet nothing really changed since the previous run the
> >> night before. Are these new tests ? What were the failures ?
> >>   
> >>     
> > The tests use random seeds and thus can catch other bugs in each run.
> > I am investigating these failures. Some might be due to bugs in the 
> > checker code too.
> >
> > Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
> > failed 1 test).
> > This to imply the bug is a hard to find one.
> >   
> >> The repetitions have also been reduced from previous reports. Are these
> >> the same or different tests ?
> >>   
> >>     
> > Number of repetitions depends on runtime. The regression started later 
> > thus run less iterations.
> > I run the "same" tests ("same" means same code not same random sequence).
> >   
> >> -- Hal
> >>
> >>
> >> _______________________________________________
> >> openib-general mailing list
> >> openib-general at openib.org
> >> http://openib.org/mailman/listinfo/openib-general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>   
> >>     
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
>