[openib-general] question on opensm error

Hal Rosenstock halr at voltaire.com
Wed Feb 16 08:47:23 PST 2005


On Wed, 2005-02-16 at 11:45, Ronald G. Minnich wrote:
> On Tue, 16 Feb 2005, Hal Rosenstock wrote:
> 
> > On Tue, 2005-02-15 at 22:22, Ronald G. Minnich wrote:
> > > On Tue, 15 Feb 2005, Hal Rosenstock wrote:
> > > 
> > > > I presume your subnet has 179 HCAs ? Do you know ?
> > > 
> > > no errors. It's just that opensm won't run. 
> > 
> > Won't run or won't do anything on the subnet ?
> > 
> > Not sure what you mean by won't run ?
> 
> ok, just found it. 
> 
> There is a sys fail red light on the CPU on the 96-port switch that the
> opensm host attaches to.
> 
> What's weird is none of the ib admin tools found anything. ibnetdiscover 
> happily walked the whole subnet. The only problem was that opensm would 
> not run, but the errors were unclear. So many things appeared to be 
> working that it did not occur to me to walk over and look at the switch. 
> Stupid of me. 

Still not 100% clear on the failure mode. I don't know what the sys fail
light on the CPU means. It may mean that things partially work. By that,
I mean the CPU might crash but the IB chips continue to function based
on their current setup. It would depend on the split of functionality
between the CPU and the IB chip firmware (which may depend on vendor).

If you were able to walk the subnet with the (SMP based) diags, the SM
port had to be at least in init (ibstat/ibstatus).

The "keys" are what was the failure mode so we can see how this can be
detected better in the future, and what caused the switch CPU to crash
in the first place.

-- Hal

> Now that I've turned that switch off I get this:
> [1108572233:000155763][40BFF970] -> __osm_state_mgr_sm_port_down_msg: 
> 
> 
> ******************************************************************
> ************************** SM PORT DOWN **************************
> ******************************************************************
> 
> 
> [1108572233:000155778][40BFF970] -> __osm_sm_state_mgr_signal_error: ERR 
> 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state 
> IB_SMINFO_STATE_DISCOVERING.
> 
> which I assume is its way of telling me that the switch port is down. 
> 
> ron




More information about the general mailing list