[openib-general] question on opensm error

shaharf shaharf at voltaire.com
Thu Feb 17 05:04:39 PST 2005


Hi,

> > There is a sys fail red light on the CPU on the 96-port switch that
the
> > opensm host attaches to.
> >
> > What's weird is none of the ib admin tools found anything.
ibnetdiscover
> > happily walked the whole subnet. The only problem was that opensm
would
> > not run, but the errors were unclear. So many things appeared to be
> > working that it did not occur to me to walk over and look at the
switch.
> > Stupid of me.
> 
> Still not 100% clear on the failure mode. I don't know what the sys
fail
> light on the CPU means. It may mean that things partially work. By
that,
> I mean the CPU might crash but the IB chips continue to function based
> on their current setup. It would depend on the split of functionality
> between the CPU and the IB chip firmware (which may depend on vendor).
> 
> If you were able to walk the subnet with the (SMP based) diags, the SM
> port had to be at least in init (ibstat/ibstatus).
> 
> The "keys" are what was the failure mode so we can see how this can be
> detected better in the future, and what caused the switch CPU to crash
> in the first place.
> 
> -- Hal
> 

I totally agree with Hal. The switch's CPU error is not the bug that is
in our concern. We should handle it is just as a failure of a device,
and we should be able to either overcome such failure or at least be
able to diagnose the error.
If you are able to reproduce the situation, please do it while the SM is
running with -V flag (full verbosity) and send the osm log file
(/tmp/osm.log) to the list. This will help us understand what is the
opensm problem. The output of the ibnetdiscover may help too.

Thanks,
	Shahar
	



More information about the general mailing list