[ewg] OpenSM from ofed-1.2 and ofed-1.3 clients

Hal Rosenstock hrosenstock at xsigo.com
Wed Jun 4 13:56:39 PDT 2008


Hi again Jim,

On Wed, 2008-06-04 at 13:40 -0700, Hal Rosenstock wrote:
> Hi Jim,
> 
> On Wed, 2008-06-04 at 13:59 -0600, Jim Schutt wrote:
> > Hi Hal,
> > 
> > I've just discovered that what I thought was ofed-1.2 opensm
> > is really ofed-1.2-rc2, if it matters.
> 
> I don't recall what the differences were but let's assume it's not
> significant for now.
> 
> > Anyway:
> > 
> > On Wed, 2008-06-04 at 13:27 -0600, Hal Rosenstock wrote:
> > > Jim,
> > > 
> > > On Wed, 2008-06-04 at 13:14 -0600, Jim Schutt wrote:
> > > > On Tue, 2008-06-03 at 11:13 -0600, Hal Rosenstock wrote:
> > > > > Steve,
> > > > >
> > > > > One more thought below...
> > > > >
> > > > > On Tue, 2008-06-03 at 09:49 -0700, Hal Rosenstock wrote:
> > > > > > Steve,
> > > > > >
> > > > > > On Tue, 2008-06-03 at 11:19 -0500, Steve Wise wrote:
> > > > > > > Hello opensm gurus:
> > > > > > >
> > > > > > > Sandia is seeing problems after migrating up to ofed-1.3.  They are
> > > > > > > still using an ofed-1.2 opensm but with ofed-1.3 clients, updated from
> > > > > > > ofed-1.2.5.
> > > > > >
> > > > > > Was the OpenSM node changed in some way or only the end nodes ?
> > > >
> > > > Only the end nodes were changed to ofed-1.3.
> > > >
> > > > > >
> > > > > > > They are getting the errors below.
> > > > > > >
> > > > > > > Q: should this work?  Or are the backwards compat issues?
> > > > > >
> > > > > > I haven't explictly tried it but I would think it should work.
> > > > > >
> > > > > > The errors below are timeouts on switch MFT sets which are only
> > > > > > indirectly related to the end nodes (in that the MC SA joins cause the
> > > > > > MC routing and those tables to be set) so I don't see the relationship
> > > > > > but might be missing something.
> > > > > >
> > > > > > -- Hal
> > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Steve.
> > > > > > >
> > > >
> > > >
> > > > > Could this switch SMA be "stuck" ?
> > > > >
> > > > > Could you try smpquery -D nodeinfo 0,1,14,9
> > > > > and
> > > > > smpquery -D nodeinfo 0,1,14
> > > > > from the SM node ?
> > > > >
> > > > > -- Hal
> > > >
> > > >
> > > > So I've stopped the 1.3 opensm (which was not logging anything)
> > > 
> > > Perhaps this OpenSM was not master.
> > > 
> > > > and started up the 1.2 opensm on another node.  After a short time
> > > > (<1 min) it starting logging these again:
> > > 
> > > That's likely when it becomes master again.
> > > 
> > > > ------
> > > >
> > > > Jun 04 11:51:13 063963 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x2 attr=0x1B trans_id=0x23000406da) -- dropping
> > > > Jun 04 11:51:13 063973 [45007960] -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0
> > > > Jun 04 11:51:13 063986 [45007960] -> Received SMP on a 3 hop path:
> > > >                                 Initial path = 0,0,0,0
> > > >                                 Return path  = 0,0,0,0
> > > > Jun 04 11:51:13 063996 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
> > > > Jun 04 11:51:13 064004 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3119: Set method failed
> > > > Jun 04 11:51:13 064034 [45007960] -> SMP dump:
> > > >                                 base_ver................0x1
> > > >                                 mgmt_class..............0x81
> > > >                                 class_ver...............0x1
> > > >                                 method..................0x2 (SubnSet)
> > > >                                 D bit...................0x0
> > > >                                 status..................0x0
> > > >                                 hop_ptr.................0x0
> > > >                                 hop_count...............0x3
> > > >                                 trans_id................0x406da
> > > >                                 attr_id.................0x1B (MulticastForwardingTable)
> > > >                                 resv....................0x0
> > > >                                 attr_mod................0x10000000
> > > >                                 m_key...................0x0000000000000000
> > > >                                 dr_slid.................0xFFFF
> > > >                                 dr_dlid.................0xFFFF
> > > >
> > > >                                 Initial path: 0,1,13,8
> > > >                                 Return path:  0,0,0,0
> > > >                                 Reserved:     [0][0][0][0][0][0][0]
> > > >
> > > >                                 00 40 00 40 00 00 00 00   00 00 00 00 00 00 00 00
> > > >
> > > >                                 00 00 00 00 00 00 00 00   00 00 00 00 00 40 00 00
> > > >
> > > >                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> > > >
> > > >                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> > > >
> > > > Jun 04 11:51:13 067533 [42803960] -> Errors during initialization
> > > > Jun 04 11:51:13 067607 [42803960] -> __osm_state_mgr_init_errors_msg:
> > > >
> > > > ------
> > > >
> > > > I tried your smpquery suggestion from the node running
> > > > the 1.2 opensm:
> > > >
> > > > # smpquery -D nodeinfo 0,1,13,8
> > > > # Node info: DR path [0][1][13][8]
> > > > BaseVers:........................1
> > > > ClassVers:.......................1
> > > > NodeType:........................Switch
> > > > NumPorts:........................24
> > > > SystemGuid:......................0x00066a0803000107
> > > > Guid:............................0x00066a00010001e8
> > > > PortGuid:........................0x00066a00010001e8
> > > > PartCap:.........................8
> > > > DevId:...........................0xb924
> > > > Revision:........................0x000000a1
> > > > LocalPort:.......................24
> > > > VendorId:........................0x00066a
> > > > # smpquery -D nodeinfo 0,1,13
> > > > # Node info: DR path [0][1][13]
> > > > BaseVers:........................1
> > > > ClassVers:.......................1
> > > > NodeType:........................Switch
> > > > NumPorts:........................24
> > > > SystemGuid:......................0x00066a0867000107
> > > > Guid:............................0x00066a0004000133
> > > > PortGuid:........................0x00066a0004000133
> > > > PartCap:.........................8
> > > > DevId:...........................0xb924
> > > > Revision:........................0x000000a1
> > > > LocalPort:.......................17
> > > > VendorId:........................0x00066a
> > > > # smpquery -D nodeinfo 0,1
> > > > # Node info: DR path [0][1]
> > > > BaseVers:........................1
> > > > ClassVers:.......................1
> > > > NodeType:........................Switch
> > > > NumPorts:........................24
> > > > SystemGuid:......................0x00066a0808000107
> > > > Guid:............................0x00066a000100015e
> > > > PortGuid:........................0x00066a000100015e
> > > > PartCap:.........................8
> > > > DevId:...........................0xb924
> > > > Revision:........................0x000000a1
> > > > LocalPort:.......................7
> > > > VendorId:........................0x00066a
> > > 
> > > 
> > > So the switch SMAs are fine.
> > > 
> > > Can you do:
> > > saquery -s
> > > as it seems there are could be more SMs in your subnet.
> > 
> > On the node running the 1.3 opensm:
> > 
> > # saquery -s
> > IsSM ports
> > PortInfoRecord dump:
> >                 EndPortLid..............0x1
> >                 PortNum.................0x1
> >                 base_lid................0x1
> >                 master_sm_base_lid......0x1
> >                 capability_mask.........0x2510A6A
> > 
> > IsSMdisabled ports
> > 
> > # service opensmd stop
> > Stopping IB Subnet Manager..-.                             [  OK  ]
> > 
> > # saquery -s
> > Query SA failed: IB_TIMEOUT
> > 
> > Then, on the node which runs the 1.2-rc2 opensm:
> > 
> > # saquery  -s
> > Query SA failed: IB_TIMEOUT
> > 
> > # service opensmd start
> > Starting IB Subnet Manager                                 [  OK  ]
> > 
> > # saquery  -s
> > IsSM ports
> > PortInfoRecord dump:
> >                 EndPortLid..............0x88
> >                 PortNum.................0x1
> >                 base_lid................0x88
> >                 master_sm_base_lid......0x88
> >                 capability_mask.........0x2510A6A
> > 
> > IsSMdisabled ports
> > 
> > Then, back on the node where I would run the 1.3 opensm:
> > 
> > # saquery -s
> > IsSM ports
> > PortInfoRecord dump:
> >                 EndPortLid..............0x88
> >                 PortNum.................0x1
> >                 base_lid................0x88
> >                 master_sm_base_lid......0x88
> >                 capability_mask.........0x2510A6A
> > 
> > IsSMdisabled ports
> 
> OK; this clearly shows there's only 1 SM at a time here.
> 
> Did this exact cluster work fine (with OFED 1.2 (rc2) OpenSM) when the
> end nodes were OFED 1.2 rather than 1.3 and that was the only change ?
> 
> Did the cluster size change too by any chance ? How large a cluster is
> this ? (There were some fixes here which should help for OFED 1.3).

There is an environment variable which can help here. It is
OSM_UMAD_MAX_PENDING which defaults to 1000 but can be made larger.

> What opensm command line and config file options are being used to start
> the OpenSMs ?

Of particular interest is the maxsmps value. If it is set to 0, that
means infinite and is bad. I know 15 works. There was a thread with
Christopher Maestas <cdmaest at sandia.gov> entitled "running opensm 3.0.3
on 4000+ node system" back in early April.

-- Hal

> Are you trying to stick with the OFED 1.2 (rc2) OpenSM or would the OFED
> 1.3 OpenSM be OK if it worked in your environment ?
> 
> Sorry for all the questions but I'm trying to come up with a theory on
> what's not right.
> 
> -- Hal
> 
> > -- Jim
> > 
> > > 
> > > > I logged 43 MB of the above errors in 11 minutes.
> > > > While opensm was logging those errors, I tried pinging
> > > > another node in the fabric via IPoIB, and it was
> > > > reachable.
> > > >
> > > > I stopped the 1.2 opensm and restarted the 1.3 opensm.
> > > > It logged nothing after the initial startup messages.
> > > 
> > > Likely it is not master. One way to check this via SMLID comparison.
> > > 
> > > -- Hal
> > > 
> > > > -- Jim
> > > >
> > > >
> > > >
> > > 
> > > 
> > 
> > 




More information about the ewg mailing list