[ewg] OpenSM from ofed-1.2 and ofed-1.3 clients

Hal Rosenstock hrosenstock at xsigo.com
Wed Jun 4 12:27:23 PDT 2008


Jim,

On Wed, 2008-06-04 at 13:14 -0600, Jim Schutt wrote:
> On Tue, 2008-06-03 at 11:13 -0600, Hal Rosenstock wrote:
> > Steve,
> > 
> > One more thought below...
> > 
> > On Tue, 2008-06-03 at 09:49 -0700, Hal Rosenstock wrote:
> > > Steve,
> > >
> > > On Tue, 2008-06-03 at 11:19 -0500, Steve Wise wrote:
> > > > Hello opensm gurus:
> > > >
> > > > Sandia is seeing problems after migrating up to ofed-1.3.  They are
> > > > still using an ofed-1.2 opensm but with ofed-1.3 clients, updated from
> > > > ofed-1.2.5.
> > >
> > > Was the OpenSM node changed in some way or only the end nodes ?
> 
> Only the end nodes were changed to ofed-1.3.
> 
> > >
> > > > They are getting the errors below.
> > > >
> > > > Q: should this work?  Or are the backwards compat issues?
> > >
> > > I haven't explictly tried it but I would think it should work.
> > >
> > > The errors below are timeouts on switch MFT sets which are only
> > > indirectly related to the end nodes (in that the MC SA joins cause the
> > > MC routing and those tables to be set) so I don't see the relationship
> > > but might be missing something.
> > >
> > > -- Hal
> > >
> > > > Thanks,
> > > >
> > > > Steve.
> > > >
> 
> 
> > Could this switch SMA be "stuck" ?
> > 
> > Could you try smpquery -D nodeinfo 0,1,14,9
> > and
> > smpquery -D nodeinfo 0,1,14
> > from the SM node ?
> > 
> > -- Hal
> 
> 
> So I've stopped the 1.3 opensm (which was not logging anything)

Perhaps this OpenSM was not master.

> and started up the 1.2 opensm on another node.  After a short time
> (<1 min) it starting logging these again:

That's likely when it becomes master again.

> ------
> 
> Jun 04 11:51:13 063963 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x2 attr=0x1B trans_id=0x23000406da) -- dropping
> Jun 04 11:51:13 063973 [45007960] -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0
> Jun 04 11:51:13 063986 [45007960] -> Received SMP on a 3 hop path:
>                                 Initial path = 0,0,0,0
>                                 Return path  = 0,0,0,0
> Jun 04 11:51:13 063996 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
> Jun 04 11:51:13 064004 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3119: Set method failed
> Jun 04 11:51:13 064034 [45007960] -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x2 (SubnSet)
>                                 D bit...................0x0
>                                 status..................0x0
>                                 hop_ptr.................0x0
>                                 hop_count...............0x3
>                                 trans_id................0x406da
>                                 attr_id.................0x1B (MulticastForwardingTable)
>                                 resv....................0x0
>                                 attr_mod................0x10000000
>                                 m_key...................0x0000000000000000
>                                 dr_slid.................0xFFFF
>                                 dr_dlid.................0xFFFF
> 
>                                 Initial path: 0,1,13,8
>                                 Return path:  0,0,0,0
>                                 Reserved:     [0][0][0][0][0][0][0]
> 
>                                 00 40 00 40 00 00 00 00   00 00 00 00 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 40 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
> 
> Jun 04 11:51:13 067533 [42803960] -> Errors during initialization
> Jun 04 11:51:13 067607 [42803960] -> __osm_state_mgr_init_errors_msg:
> 
> ------
> 
> I tried your smpquery suggestion from the node running
> the 1.2 opensm:
> 
> # smpquery -D nodeinfo 0,1,13,8
> # Node info: DR path [0][1][13][8]
> BaseVers:........................1
> ClassVers:.......................1
> NodeType:........................Switch
> NumPorts:........................24
> SystemGuid:......................0x00066a0803000107
> Guid:............................0x00066a00010001e8
> PortGuid:........................0x00066a00010001e8
> PartCap:.........................8
> DevId:...........................0xb924
> Revision:........................0x000000a1
> LocalPort:.......................24
> VendorId:........................0x00066a
> # smpquery -D nodeinfo 0,1,13
> # Node info: DR path [0][1][13]
> BaseVers:........................1
> ClassVers:.......................1
> NodeType:........................Switch
> NumPorts:........................24
> SystemGuid:......................0x00066a0867000107
> Guid:............................0x00066a0004000133
> PortGuid:........................0x00066a0004000133
> PartCap:.........................8
> DevId:...........................0xb924
> Revision:........................0x000000a1
> LocalPort:.......................17
> VendorId:........................0x00066a
> # smpquery -D nodeinfo 0,1
> # Node info: DR path [0][1]
> BaseVers:........................1
> ClassVers:.......................1
> NodeType:........................Switch
> NumPorts:........................24
> SystemGuid:......................0x00066a0808000107
> Guid:............................0x00066a000100015e
> PortGuid:........................0x00066a000100015e
> PartCap:.........................8
> DevId:...........................0xb924
> Revision:........................0x000000a1
> LocalPort:.......................7
> VendorId:........................0x00066a


So the switch SMAs are fine.

Can you do:
saquery -s
as it seems there are could be more SMs in your subnet.

> I logged 43 MB of the above errors in 11 minutes.
> While opensm was logging those errors, I tried pinging
> another node in the fabric via IPoIB, and it was 
> reachable.
> 
> I stopped the 1.2 opensm and restarted the 1.3 opensm.
> It logged nothing after the initial startup messages.

Likely it is not master. One way to check this via SMLID comparison.

-- Hal

> -- Jim
> 
> 
> 




More information about the ewg mailing list