[openib-general] openSM failover / failback issue?
Sean Hefty
mshefty at ichips.intel.com
Wed Jul 12 15:36:23 PDT 2006
Hal Rosenstock wrote:
> With the default sminfo_polling_timeout of 10 seconds and default
> polling_retry_number of 4, so the total handoff time should be around 40
> seconds. I just did that experiment with 2 SMs and saw that as well.
Okay - I narrowed down the test case to something reproducible.
I have 2 systems connected into Woody's cluster. I'm not sure how many systems
are in Woody's cluster, but probably around 8. OpenSM is running on one of the
systems in the cluster. If I run osmtest from either of my two systems, it
works fine.
If I start openSM on one of my systems, it becomes the master SM. The LIDs on
my systems are reassigned. If I run osmtest from either of my two systems, it
still works fine.
If I kill openSM on my system, then run osmtest -f c, I get a failure: Error on
query (IB_TIMEOUT). It looks like a CLASS_PORT_INFO query, but the query is
going to my now dead opensm system.
At this point, if I unload / reload ib_mthca on either of my systems, Woody's SM
kicks in, reassigns my systems' LIDs, and osmtest starts working again.
I don't know if this is an HCA firmware issues, switch issue, or openSM issue.
I don't think it's related to my changes or osmtest at this point.
- Sean
More information about the general
mailing list