[openib-general] openSM failover / failback issue?

Hal Rosenstock halr at voltaire.com
Wed Jul 12 18:45:47 PDT 2006


On Wed, 2006-07-12 at 18:36, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > With the default sminfo_polling_timeout of 10 seconds and default
> > polling_retry_number of 4, so the total handoff time should be around 40
> > seconds. I just did that experiment with 2 SMs and saw that as well.
> 
> Okay - I narrowed down the test case to something reproducible.
> 
> I have 2 systems connected into Woody's cluster.  I'm not sure how many systems 
> are in Woody's cluster, but probably around 8.  OpenSM is running on one of the 
> systems in the cluster.  If I run osmtest from either of my two systems, it 
> works fine.
> 
> If I start openSM on one of my systems, it becomes the master SM.  The LIDs on 
> my systems are reassigned.  If I run osmtest from either of my two systems, it 
> still works fine.
> 
> If I kill openSM on my system, then run osmtest -f c, I get a failure: Error on 
> query (IB_TIMEOUT).  It looks like a CLASS_PORT_INFO query,

Yes, that's the first SA query that osmtest makes.

> but the query is going to my now dead opensm system.

What does ibstat/ibstatus say for the SMLID on the osmtest machine ?
What about the OpenSM machine ?

> At this point, if I unload / reload ib_mthca on either of my systems, Woody's SM 
> kicks in, reassigns my systems' LIDs, and osmtest starts working again.

to Woody's SM.

> I don't know if this is an HCA firmware issues, switch issue, or openSM issue. 
> I don't think it's related to my changes or osmtest at this point.

I'll see if I can reproduce this tomorrow.

Also, can you send me the guid2lid files from the 3 SMs ?

-- Hal

> - Sean





More information about the general mailing list