[openib-general] openSM failover / failback issue?
Hal Rosenstock
halr at voltaire.com
Wed Jul 12 18:45:47 PDT 2006
On Wed, 2006-07-12 at 18:36, Sean Hefty wrote:
> Hal Rosenstock wrote:
> > With the default sminfo_polling_timeout of 10 seconds and default
> > polling_retry_number of 4, so the total handoff time should be around 40
> > seconds. I just did that experiment with 2 SMs and saw that as well.
>
> Okay - I narrowed down the test case to something reproducible.
>
> I have 2 systems connected into Woody's cluster. I'm not sure how many systems
> are in Woody's cluster, but probably around 8. OpenSM is running on one of the
> systems in the cluster. If I run osmtest from either of my two systems, it
> works fine.
>
> If I start openSM on one of my systems, it becomes the master SM. The LIDs on
> my systems are reassigned. If I run osmtest from either of my two systems, it
> still works fine.
>
> If I kill openSM on my system, then run osmtest -f c, I get a failure: Error on
> query (IB_TIMEOUT). It looks like a CLASS_PORT_INFO query,
Yes, that's the first SA query that osmtest makes.
> but the query is going to my now dead opensm system.
What does ibstat/ibstatus say for the SMLID on the osmtest machine ?
What about the OpenSM machine ?
> At this point, if I unload / reload ib_mthca on either of my systems, Woody's SM
> kicks in, reassigns my systems' LIDs, and osmtest starts working again.
to Woody's SM.
> I don't know if this is an HCA firmware issues, switch issue, or openSM issue.
> I don't think it's related to my changes or osmtest at this point.
I'll see if I can reproduce this tomorrow.
Also, can you send me the guid2lid files from the 3 SMs ?
-- Hal
> - Sean
More information about the general
mailing list