[openib-general] nightly osm_sim report 2006-12-14:normal completion

Thu Dec 14 11:53:52 PST 2006

Update on analysis of failures:

Eitan Zahavi wrote:
> Hal Rosenstock wrote:
>   
>> Hi Eitan,
>>
>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
>>   
>>     
>>> OSM Simulation Regression Summary
>>> OpenSM rev = ____  
>>> ibutils rev = ____  
>>> Total=264 Pass=261 Fail=3
>>>
>>> Pass:
>>> 36 Stability IS1-16.topo
>>> 36 Pkey IS1-16.topo
>>> 36 Multicast IS1-16.topo
>>> 36 LidMgr IS1-16.topo
>>> 35 OsmStress IS1-16.topo
>>> 12 Stability IS3-loop.topo
>>> 12 Stability IS3-128.topo
>>> 12 Pkey IS3-128.topo
>>> 12 OsmStress IS3-128.topo
>>> 12 Multicast IS3-loop.topo
>>> 11 Multicast IS3-128.topo
>>> 11 LidMgr IS3-128.topo
>>>
>>> Failures:
>>> 1 OsmStress IS1-16.topo
>>>       
Job was killed in the middle. Just an accident.
>>> 1 Multicast IS3-128.topo
>>>       
A single packet was dropped on the way to the SM. Still not clear where.
However, I have seen a perfectly good link reported by the drop manager 
as missing.
I will rerun some tests with valgrind as  I think this might be a memory 
corruption issue.
>>> 1 LidMgr IS3-128.topo
>>>       
Seems like the last sweep started before the last change in LID was 
made. So it missed one of the nodes.
Additional sweep was enforced at the end of the test - just to make sure 
all changes are handled.
>>>     
>>>       
>> There are now 2 more failures. You had previously explained OsmStress
>> failure as needing more investigation. Now there is a Multicast and
>> LidMgr failure yet nothing really changed since the previous run the
>> night before. Are these new tests ? What were the failures ?
>>   
>>     
> The tests use random seeds and thus can catch other bugs in each run.
> I am investigating these failures. Some might be due to bugs in the 
> checker code too.
>
> Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
> failed 1 test).
> This to imply the bug is a hard to find one.
>   
>> The repetitions have also been reduced from previous reports. Are these
>> the same or different tests ?
>>   
>>     
> Number of repetitions depends on runtime. The regression started later 
> thus run less iterations.
> I run the "same" tests ("same" means same code not same random sequence).
>   
>> -- Hal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>   
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>