[ewg] OpenSM from ofed-1.2 and ofed-1.3 clients
Hal Rosenstock
hrosenstock at xsigo.com
Wed Jun 4 13:40:18 PDT 2008
Hi Jim,
On Wed, 2008-06-04 at 13:59 -0600, Jim Schutt wrote:
> Hi Hal,
>
> I've just discovered that what I thought was ofed-1.2 opensm
> is really ofed-1.2-rc2, if it matters.
I don't recall what the differences were but let's assume it's not
significant for now.
> Anyway:
>
> On Wed, 2008-06-04 at 13:27 -0600, Hal Rosenstock wrote:
> > Jim,
> >
> > On Wed, 2008-06-04 at 13:14 -0600, Jim Schutt wrote:
> > > On Tue, 2008-06-03 at 11:13 -0600, Hal Rosenstock wrote:
> > > > Steve,
> > > >
> > > > One more thought below...
> > > >
> > > > On Tue, 2008-06-03 at 09:49 -0700, Hal Rosenstock wrote:
> > > > > Steve,
> > > > >
> > > > > On Tue, 2008-06-03 at 11:19 -0500, Steve Wise wrote:
> > > > > > Hello opensm gurus:
> > > > > >
> > > > > > Sandia is seeing problems after migrating up to ofed-1.3. They are
> > > > > > still using an ofed-1.2 opensm but with ofed-1.3 clients, updated from
> > > > > > ofed-1.2.5.
> > > > >
> > > > > Was the OpenSM node changed in some way or only the end nodes ?
> > >
> > > Only the end nodes were changed to ofed-1.3.
> > >
> > > > >
> > > > > > They are getting the errors below.
> > > > > >
> > > > > > Q: should this work? Or are the backwards compat issues?
> > > > >
> > > > > I haven't explictly tried it but I would think it should work.
> > > > >
> > > > > The errors below are timeouts on switch MFT sets which are only
> > > > > indirectly related to the end nodes (in that the MC SA joins cause the
> > > > > MC routing and those tables to be set) so I don't see the relationship
> > > > > but might be missing something.
> > > > >
> > > > > -- Hal
> > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Steve.
> > > > > >
> > >
> > >
> > > > Could this switch SMA be "stuck" ?
> > > >
> > > > Could you try smpquery -D nodeinfo 0,1,14,9
> > > > and
> > > > smpquery -D nodeinfo 0,1,14
> > > > from the SM node ?
> > > >
> > > > -- Hal
> > >
> > >
> > > So I've stopped the 1.3 opensm (which was not logging anything)
> >
> > Perhaps this OpenSM was not master.
> >
> > > and started up the 1.2 opensm on another node. After a short time
> > > (<1 min) it starting logging these again:
> >
> > That's likely when it becomes master again.
> >
> > > ------
> > >
> > > Jun 04 11:51:13 063963 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x2 attr=0x1B trans_id=0x23000406da) -- dropping
> > > Jun 04 11:51:13 063973 [45007960] -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0
> > > Jun 04 11:51:13 063986 [45007960] -> Received SMP on a 3 hop path:
> > > Initial path = 0,0,0,0
> > > Return path = 0,0,0,0
> > > Jun 04 11:51:13 063996 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
> > > Jun 04 11:51:13 064004 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3119: Set method failed
> > > Jun 04 11:51:13 064034 [45007960] -> SMP dump:
> > > base_ver................0x1
> > > mgmt_class..............0x81
> > > class_ver...............0x1
> > > method..................0x2 (SubnSet)
> > > D bit...................0x0
> > > status..................0x0
> > > hop_ptr.................0x0
> > > hop_count...............0x3
> > > trans_id................0x406da
> > > attr_id.................0x1B (MulticastForwardingTable)
> > > resv....................0x0
> > > attr_mod................0x10000000
> > > m_key...................0x0000000000000000
> > > dr_slid.................0xFFFF
> > > dr_dlid.................0xFFFF
> > >
> > > Initial path: 0,1,13,8
> > > Return path: 0,0,0,0
> > > Reserved: [0][0][0][0][0][0][0]
> > >
> > > 00 40 00 40 00 00 00 00 00 00 00 00 00 00 00 00
> > >
> > > 00 00 00 00 00 00 00 00 00 00 00 00 00 40 00 00
> > >
> > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >
> > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >
> > > Jun 04 11:51:13 067533 [42803960] -> Errors during initialization
> > > Jun 04 11:51:13 067607 [42803960] -> __osm_state_mgr_init_errors_msg:
> > >
> > > ------
> > >
> > > I tried your smpquery suggestion from the node running
> > > the 1.2 opensm:
> > >
> > > # smpquery -D nodeinfo 0,1,13,8
> > > # Node info: DR path [0][1][13][8]
> > > BaseVers:........................1
> > > ClassVers:.......................1
> > > NodeType:........................Switch
> > > NumPorts:........................24
> > > SystemGuid:......................0x00066a0803000107
> > > Guid:............................0x00066a00010001e8
> > > PortGuid:........................0x00066a00010001e8
> > > PartCap:.........................8
> > > DevId:...........................0xb924
> > > Revision:........................0x000000a1
> > > LocalPort:.......................24
> > > VendorId:........................0x00066a
> > > # smpquery -D nodeinfo 0,1,13
> > > # Node info: DR path [0][1][13]
> > > BaseVers:........................1
> > > ClassVers:.......................1
> > > NodeType:........................Switch
> > > NumPorts:........................24
> > > SystemGuid:......................0x00066a0867000107
> > > Guid:............................0x00066a0004000133
> > > PortGuid:........................0x00066a0004000133
> > > PartCap:.........................8
> > > DevId:...........................0xb924
> > > Revision:........................0x000000a1
> > > LocalPort:.......................17
> > > VendorId:........................0x00066a
> > > # smpquery -D nodeinfo 0,1
> > > # Node info: DR path [0][1]
> > > BaseVers:........................1
> > > ClassVers:.......................1
> > > NodeType:........................Switch
> > > NumPorts:........................24
> > > SystemGuid:......................0x00066a0808000107
> > > Guid:............................0x00066a000100015e
> > > PortGuid:........................0x00066a000100015e
> > > PartCap:.........................8
> > > DevId:...........................0xb924
> > > Revision:........................0x000000a1
> > > LocalPort:.......................7
> > > VendorId:........................0x00066a
> >
> >
> > So the switch SMAs are fine.
> >
> > Can you do:
> > saquery -s
> > as it seems there are could be more SMs in your subnet.
>
> On the node running the 1.3 opensm:
>
> # saquery -s
> IsSM ports
> PortInfoRecord dump:
> EndPortLid..............0x1
> PortNum.................0x1
> base_lid................0x1
> master_sm_base_lid......0x1
> capability_mask.........0x2510A6A
>
> IsSMdisabled ports
>
> # service opensmd stop
> Stopping IB Subnet Manager..-. [ OK ]
>
> # saquery -s
> Query SA failed: IB_TIMEOUT
>
> Then, on the node which runs the 1.2-rc2 opensm:
>
> # saquery -s
> Query SA failed: IB_TIMEOUT
>
> # service opensmd start
> Starting IB Subnet Manager [ OK ]
>
> # saquery -s
> IsSM ports
> PortInfoRecord dump:
> EndPortLid..............0x88
> PortNum.................0x1
> base_lid................0x88
> master_sm_base_lid......0x88
> capability_mask.........0x2510A6A
>
> IsSMdisabled ports
>
> Then, back on the node where I would run the 1.3 opensm:
>
> # saquery -s
> IsSM ports
> PortInfoRecord dump:
> EndPortLid..............0x88
> PortNum.................0x1
> base_lid................0x88
> master_sm_base_lid......0x88
> capability_mask.........0x2510A6A
>
> IsSMdisabled ports
OK; this clearly shows there's only 1 SM at a time here.
Did this exact cluster work fine (with OFED 1.2 (rc2) OpenSM) when the
end nodes were OFED 1.2 rather than 1.3 and that was the only change ?
Did the cluster size change too by any chance ? How large a cluster is
this ? (There were some fixes here which should help for OFED 1.3).
What opensm command line and config file options are being used to start
the OpenSMs ?
Are you trying to stick with the OFED 1.2 (rc2) OpenSM or would the OFED
1.3 OpenSM be OK if it worked in your environment ?
Sorry for all the questions but I'm trying to come up with a theory on
what's not right.
-- Hal
> -- Jim
>
> >
> > > I logged 43 MB of the above errors in 11 minutes.
> > > While opensm was logging those errors, I tried pinging
> > > another node in the fabric via IPoIB, and it was
> > > reachable.
> > >
> > > I stopped the 1.2 opensm and restarted the 1.3 opensm.
> > > It logged nothing after the initial startup messages.
> >
> > Likely it is not master. One way to check this via SMLID comparison.
> >
> > -- Hal
> >
> > > -- Jim
> > >
> > >
> > >
> >
> >
>
>
More information about the ewg
mailing list