[openib-general] openSM failover / failback issue?

Sean Hefty mshefty at ichips.intel.com
Wed Jul 12 15:36:23 PDT 2006


Hal Rosenstock wrote:
> With the default sminfo_polling_timeout of 10 seconds and default
> polling_retry_number of 4, so the total handoff time should be around 40
> seconds. I just did that experiment with 2 SMs and saw that as well.

Okay - I narrowed down the test case to something reproducible.

I have 2 systems connected into Woody's cluster.  I'm not sure how many systems 
are in Woody's cluster, but probably around 8.  OpenSM is running on one of the 
systems in the cluster.  If I run osmtest from either of my two systems, it 
works fine.

If I start openSM on one of my systems, it becomes the master SM.  The LIDs on 
my systems are reassigned.  If I run osmtest from either of my two systems, it 
still works fine.

If I kill openSM on my system, then run osmtest -f c, I get a failure: Error on 
query (IB_TIMEOUT).  It looks like a CLASS_PORT_INFO query, but the query is 
going to my now dead opensm system.

At this point, if I unload / reload ib_mthca on either of my systems, Woody's SM 
kicks in, reassigns my systems' LIDs, and osmtest starts working again.

I don't know if this is an HCA firmware issues, switch issue, or openSM issue. 
I don't think it's related to my changes or osmtest at this point.

- Sean




More information about the general mailing list