[openib-general] nightly osm_sim report 2006-12-14:normal completion

Eitan Zahavi eitan at mellanox.co.il
Thu Dec 14 12:24:26 PST 2006


Hal Rosenstock wrote:
> On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote:
>   
>> Update on analysis of failures:
>>
>> Eitan Zahavi wrote:
>>     
>>> Hal Rosenstock wrote:
>>>   
>>>       
>>>> Hi Eitan,
>>>>
>>>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
>>>>   
>>>>     
>>>>         
>>>>> OSM Simulation Regression Summary
>>>>> OpenSM rev = ____  
>>>>> ibutils rev = ____  
>>>>> Total=264 Pass=261 Fail=3
>>>>>
>>>>> Pass:
>>>>> 36 Stability IS1-16.topo
>>>>> 36 Pkey IS1-16.topo
>>>>> 36 Multicast IS1-16.topo
>>>>> 36 LidMgr IS1-16.topo
>>>>> 35 OsmStress IS1-16.topo
>>>>> 12 Stability IS3-loop.topo
>>>>> 12 Stability IS3-128.topo
>>>>> 12 Pkey IS3-128.topo
>>>>> 12 OsmStress IS3-128.topo
>>>>> 12 Multicast IS3-loop.topo
>>>>> 11 Multicast IS3-128.topo
>>>>> 11 LidMgr IS3-128.topo
>>>>>
>>>>> Failures:
>>>>> 1 OsmStress IS1-16.topo
>>>>>       
>>>>>           
>> Job was killed in the middle. Just an accident.
>>     
>
> Is that always the case ? This one has been consistently failing.
> I think you had written something about this failure back in July. I can
> dig it out if you want.
>
>   
>>>>> 1 Multicast IS3-128.topo
>>>>>       
>>>>>           
>> A single packet was dropped on the way to the SM. Still not clear where.
>> However, I have seen a perfectly good link reported by the drop manager 
>> as missing.
>>     
>
> I think I may have seen this as well on some rare occasions. I could
> never figure out why this happened.
>
>   
>> I will rerun some tests with valgrind as  I think this might be a memory 
>> corruption issue.
>>     
>
> OK.
>
>   
>>>>> 1 LidMgr IS3-128.topo
>>>>>       
>>>>>           
>> Seems like the last sweep started before the last change in LID was 
>> made. So it missed one of the nodes.
>> Additional sweep was enforced at the end of the test - just to make sure 
>> all changes are handled.
>>     
>
> So is this being reported as a failure improperly then ?
>   
Well the test failed. The fix was committed. We will see in the next few 
days if it is really a false alarm.
> -- Hal
>
>   
>>>>>     
>>>>>       
>>>>>           
>>>> There are now 2 more failures. You had previously explained OsmStress
>>>> failure as needing more investigation. Now there is a Multicast and
>>>> LidMgr failure yet nothing really changed since the previous run the
>>>> night before. Are these new tests ? What were the failures ?
>>>>   
>>>>     
>>>>         
>>> The tests use random seeds and thus can catch other bugs in each run.
>>> I am investigating these failures. Some might be due to bugs in the 
>>> checker code too.
>>>
>>> Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
>>> failed 1 test).
>>> This to imply the bug is a hard to find one.
>>>   
>>>       
>>>> The repetitions have also been reduced from previous reports. Are these
>>>> the same or different tests ?
>>>>   
>>>>     
>>>>         
>>> Number of repetitions depends on runtime. The regression started later 
>>> thus run less iterations.
>>> I run the "same" tests ("same" means same code not same random sequence).
>>>   
>>>       
>>>> -- Hal
>>>>
>>>>
>>>> _______________________________________________
>>>> openib-general mailing list
>>>> openib-general at openib.org
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>   
>>>>     
>>>>         
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   





More information about the general mailing list