[openib-general] nightly osm_sim report 2006-12-14:normal completion
Eitan Zahavi
eitan at mellanox.co.il
Thu Dec 14 12:24:26 PST 2006
Hal Rosenstock wrote:
> On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote:
>
>> Update on analysis of failures:
>>
>> Eitan Zahavi wrote:
>>
>>> Hal Rosenstock wrote:
>>>
>>>
>>>> Hi Eitan,
>>>>
>>>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
>>>>
>>>>
>>>>
>>>>> OSM Simulation Regression Summary
>>>>> OpenSM rev = ____
>>>>> ibutils rev = ____
>>>>> Total=264 Pass=261 Fail=3
>>>>>
>>>>> Pass:
>>>>> 36 Stability IS1-16.topo
>>>>> 36 Pkey IS1-16.topo
>>>>> 36 Multicast IS1-16.topo
>>>>> 36 LidMgr IS1-16.topo
>>>>> 35 OsmStress IS1-16.topo
>>>>> 12 Stability IS3-loop.topo
>>>>> 12 Stability IS3-128.topo
>>>>> 12 Pkey IS3-128.topo
>>>>> 12 OsmStress IS3-128.topo
>>>>> 12 Multicast IS3-loop.topo
>>>>> 11 Multicast IS3-128.topo
>>>>> 11 LidMgr IS3-128.topo
>>>>>
>>>>> Failures:
>>>>> 1 OsmStress IS1-16.topo
>>>>>
>>>>>
>> Job was killed in the middle. Just an accident.
>>
>
> Is that always the case ? This one has been consistently failing.
> I think you had written something about this failure back in July. I can
> dig it out if you want.
>
>
>>>>> 1 Multicast IS3-128.topo
>>>>>
>>>>>
>> A single packet was dropped on the way to the SM. Still not clear where.
>> However, I have seen a perfectly good link reported by the drop manager
>> as missing.
>>
>
> I think I may have seen this as well on some rare occasions. I could
> never figure out why this happened.
>
>
>> I will rerun some tests with valgrind as I think this might be a memory
>> corruption issue.
>>
>
> OK.
>
>
>>>>> 1 LidMgr IS3-128.topo
>>>>>
>>>>>
>> Seems like the last sweep started before the last change in LID was
>> made. So it missed one of the nodes.
>> Additional sweep was enforced at the end of the test - just to make sure
>> all changes are handled.
>>
>
> So is this being reported as a failure improperly then ?
>
Well the test failed. The fix was committed. We will see in the next few
days if it is really a false alarm.
> -- Hal
>
>
>>>>>
>>>>>
>>>>>
>>>> There are now 2 more failures. You had previously explained OsmStress
>>>> failure as needing more investigation. Now there is a Multicast and
>>>> LidMgr failure yet nothing really changed since the previous run the
>>>> night before. Are these new tests ? What were the failures ?
>>>>
>>>>
>>>>
>>> The tests use random seeds and thus can catch other bugs in each run.
>>> I am investigating these failures. Some might be due to bugs in the
>>> checker code too.
>>>
>>> Please pay attention the failure rate is low (LidMgr pass 36+11 runs
>>> failed 1 test).
>>> This to imply the bug is a hard to find one.
>>>
>>>
>>>> The repetitions have also been reduced from previous reports. Are these
>>>> the same or different tests ?
>>>>
>>>>
>>>>
>>> Number of repetitions depends on runtime. The regression started later
>>> thus run less iterations.
>>> I run the "same" tests ("same" means same code not same random sequence).
>>>
>>>
>>>> -- Hal
>>>>
>>>>
>>>> _______________________________________________
>>>> openib-general mailing list
>>>> openib-general at openib.org
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
>>>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list