[openib-general] OpenSM not coming out of standby state..

Hal Rosenstock halr at voltaire.com
Wed Nov 30 22:04:49 PST 2005


Hi Troy,

On Wed, 2005-11-30 at 20:54, Troy Benjegerdes wrote:
> On Wed, Nov 30, 2005 at 07:33:44PM -0600, Troy Benjegerdes wrote:
> > A couple of days ago I started up two instances of opensm on my network,
> > and set one with priority 11, the other with the default 10.
> > 
> > I could kill one and the other would become master a few minutes later.
> > 
> > Well, today, I found that there are no active links anywhere in the
> > network.. But both SM's still appeared to be running.

OK. It seems weird that the master would respond to polls from the
standby but not bring links to active. Do you have a log of the master ?
That might be more informative.

> > then I killed them both, and restarted one with 'opensm -V -p 11', 
> > 
> > it is still staying in STANDBY state, and produced the 4MB log available at
> >
> > http://scl.ameslab.gov/~troy/osm.log-nomaster

There is another master out there:
Nov 30 19:24:54 421835 [41802960] -> SMInfo dump:
				guid....................0x0002c90108cd8ba1
				sm_key..................0x0000000000000000
				act_count...............7389948
				priority................0
				sm_state................3

which shows the other SM (state 3 = master) at priority 0. What node is
that ? What's weird is that the higher priority SM does not appear to
take over from it. Not sure why right now.


Another weird thing is the following at the start of the log:

Nov 30 19:22:41 420334 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd85f1 value:0x000c 0x000c
Nov 30 19:22:41 420342 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a0000458 value:0x0007 0x0007
Nov 30 19:22:41 420349 [AB44FCE0] -> osm_db_restore: Got key:0x0002550000039e00 value:0x0003 0x0003
Nov 30 19:22:41 420355 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a0000444 value:0x0006 0x0006
Nov 30 19:22:41 420362 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a000043c value:0x0004 0x0004
Nov 30 19:22:41 420369 [AB44FCE0] -> osm_db_restore: Got key:0x67609ef000040000 value:0x0017 0x0017
Nov 30 19:22:41 420375 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402915 value:0x0002 0x0002
Nov 30 19:22:41 420382 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd9bd1 value:0x000b 0x000b
Nov 30 19:22:41 420388 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd98c1 value:0x000a 0x000a
Nov 30 19:22:41 420395 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd84a1 value:0x000d 0x000d
Nov 30 19:22:41 420402 [AB44FCE0] -> osm_db_restore: Got key:0x6760a0f000040080 value:0x0021 0x0021
Nov 30 19:22:41 420408 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200007b3d value:0x0014 0x0014
Nov 30 19:22:41 420415 [AB44FCE0] -> osm_db_restore: Got key:0x0002c9020040272d value:0x001d 0x001d
Nov 30 19:22:41 420421 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402917 value:0x000e 0x000e
Nov 30 19:22:41 420428 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402781 value:0x0001 0x0001
Nov 30 19:22:41 420435 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402782 value:0x0009 0x0009
Nov 30 19:22:41 420445 [AB44FCE0] -> osm_db_restore: Got key:0x6760cef000040080 value:0x000f 0x000f
Nov 30 19:22:41 420452 [AB44FCE0] -> osm_db_restore: Got key:0x0002c9020040272e value:0x001e 0x001e
Nov 30 19:22:41 420459 [AB44FCE0] -> osm_db_restore: Got key:0x6760cef000040000 value:0x0012 0x0012
Nov 30 19:22:41 420465 [AB44FCE0] -> osm_db_restore: Got key:0x6760a0f000040000 value:0x0022 0x0022
Nov 30 19:22:41 420472 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd0b71 value:0x0010 0x0010
Nov 30 19:22:41 420479 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a000044e value:0x0011 0x0011
Nov 30 19:22:41 420485 [AB44FCE0] -> osm_db_restore: Got key:0x0002550000039e80 value:0x0013 0x0013
Nov 30 19:22:41 420492 [AB44FCE0] -> osm_db_restore: Got key:0x67609ef000040080 value:0x0015 0x0015
Nov 30 19:22:41 420498 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a0000441 value:0x0005 0x0005
Nov 30 19:22:41 420505 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200007bd5 value:0x0016 0x0016
Nov 30 19:22:41 420512 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108ccc571 value:0x0008 0x0008

There are a number of funky OUIs here (the ones starting as 0x6760xx).
For examples, 0x6760a0 and 0x67609e do no appear to be valid OUIs. Any
idea on what equipment that is ? (Perhaps there is some endian problem
with these).

> > (Hal, if you want access to this system, let me know)

I may.

> And the rest of the story..
> 
> This happened after I cross-connected two networks, and had opensm
> running on two nodes that had back-to-back connections (with no switch).
> I didn't think anything of it at the time since the 'active' lights were
> off on the cards machines that were connected (they had physical link,
> but no logical link).

I'm not sure I have a good picture of your network topology. You cross
connected 2 subnets but there were no switches between 2 SMs. I'm
missing something here.

> I've since killed opensm on the 'new' nodes, but there is still some
> state somewhere that prevents opensm from 'nicely' becoming the master..

There appears to be another master out there but with priority 0.

> If I run with 'opensm -d 0 -p 11', it becomes master just fine. How does
> one go about tracking down a broken rogue SM that isn't bringing up the
> network?

Can you find the following GUID 0x0002c90108cd8ba1 ? I would pull that
one off the network and see. I don't see that in the original database.
Can you see it with ibnetdiscover ?

Here's it's NodeInfo from the standby log:
Nov 30 19:22:44 238782 [41001960] -> NodeInfo dump:
				base_version............0x1
				class_version...........0x1
				node_type...............Channel Adapter
				num_ports...............0x2
				sys_guid................0x0002c9000100d050
				node_guid...............0x0002c90108cd8ba0
				port_guid...............0x0002c90108cd8ba1
				partition_cap...........0x40
				device_id...............0x5A44
				revision................0xA1
				port_num................0x1
				vendor_id...............0x2C9

and its PortInfo shows:
Nov 30 19:22:44 301946 [41001960] -> Capabilities Mask:
				IB_PORT_CAP_IS_SM
				IB_PORT_CAP_HAS_TRAP
				IB_PORT_CAP_HAS_AUTO_MIG
				IB_PORT_CAP_HAS_SL_MAP
				IB_PORT_CAP_HAS_LED_INFO
				IB_PORT_CAP_HAS_SYS_IMG_GUID
				IB_PORT_CAP_HAS_VEND_CLS
				IB_PORT_CAP_HAS_CAP_NTC
that it is running an SM and this node has a LID of 4:
				port number.............0x1
				node_guid...............0x0002c90108cd8ba0
				port_guid...............0x0002c90108cd8ba1
				m_key...................0x0000000000000000
				subnet_prefix...........0xfe80000000000000
				base_lid................0x4
				master_sm_base_lid......0x4

That LID does conflict with one from the database:
Nov 30 19:22:41 420362 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a000043c value:0x0004 0x0004

Subsequent to this, the standby reports:
Nov 30 19:22:44 301995 [41001960] -> __osm_pi_rcv_process_endport: Detected another SM.  Requesting SMInfo.
				Port 0x2c90108cd8ba1.

I think there is some issue with conflict resolution of duplicated LIDs
when subnets are merged.

-- Hal





More information about the general mailing list