[openib-general] Re: opensm: new segv on shutdown

Hal Rosenstock halr at voltaire.com
Thu Jun 2 11:08:48 PDT 2005


On Thu, 2005-06-02 at 13:31, Tom Duffy wrote:
> On Wed, 2005-06-01 at 20:45 -0400, Hal Rosenstock wrote:
> > On Wed, 2005-06-01 at 16:51, Tom Duffy wrote: 
> > > I am putting together a network with a dumb IB switch, a couple of Linux
> > > OpenIB boxes, a Solaris 10 box, a Solaris Nevada box, etc.  I fired up
> > > opensm on one of the Linux nodes, tried to plumb Solaris, no luck.  I
> > > then hit control-c on opensm and it crashed.  Here is the messages and
> > > then crash.
> > 
> > Anything from the Solaris side on what it doesn't like about the OpenIB
> > RMPP ?
> 
> I am not seeing any errors coming from Solaris, I will have to enable
> debug and try again.  Clearly OpenSM is able to find the Solaris nodes
> (there are 3 solaris, 2 linux):
> 
> [root at flopteron2 ~]# ibhosts
> Hca     : 0x0002c90109766e40 ports 2 "MT23108 InfiniHost Mellanox Technologies"
> Hca     : 0x0002c90109765630 ports 2 "MT23108 InfiniHost Mellanox Technologies"
> Hca     : 0x0002c901097624c0 ports 2 "MT23108 InfiniHost Mellanox Technologies"
> Hca     : 0x0002c90109765710 ports 2 "MT23108 InfiniHost Mellanox Technologies"
> Hca     : 0x0002c9010a99e030 ports 2 "MT25208 InfiniHostEx Mellanox Technologies"
> 
> [root at flopteron2 ~]# ibnetdiscover
> #
> # Topology file: generated on Thu Jun  2 10:20:58 2005
> # 
> switchguids=0x617000000000d
> Switch  8 "S-000617000000000d"          # Agilent and RedSwitch High Performance 8 Port 4x IBA Switch port 0 lid 2
> [8]     "H-0002c90109766e40"[2] [6]     "H-0002c90109766e40"[1]
> [5]     "H-0002c90109765630"[1]
> [4]     "H-0002c90109765710"[1]
> [3]     "H-0002c901097624c0"[1]
> [2]     "H-00109765710"[2]
> [7]     "H-0002c9010a99e030"[1]
> 
> hcaguids=0x2c90109766e40
> Hca     2 "H-0002c90109766e40"          # MT23108 InfiniHost Mellanox Technologies
> [2]     "S-000617000000000d"[8]         # lid 0 lmc 0
> [1]     "S-000617000000000d"[6]         # lid 0 lmc 0
> 
> hcaguids=0x2c90109765630
> Hca     2 "H-0002c90109765630"          # MT23108 InfiniHost Mellanox Technologies
> [1]     "S-000617000000000d"[5]         # lid 4 lmc 0
> 
> hcaguids=0x2c901097624c0
> Hca     2 "H-0002c901097624c0"          # MT23108 InfiniHost Mellanox Technologies
> [1]     "S-000617000000000d"[3]         # lid 5 lmc 0
> 
> hcaguids=0x2c90109765710
> Hca     2 "H-0002c90109765710"          # MT23108 InfiniHost Mellanox Technologies
> [1]     "S-000617000000000d"[4]         # lid 18 lmc 0
> [2]     "S-000617000000000d"[2]         # lid 3 lmc 0
> 
> hcaguids=0x2c9010a99e030
> Hca     2 "H-0002c9010a99e030"          # MT25208 InfiniHostEx Mellanox Technologies
> [1]     "S-000617000000000d"[7]         # lid 1 lmc 0
> 
> ---
> 
> When I pop open ibsmgui (the graphical IB browser on Solaris), I get an
> error:
> 
> sa_access_retrieve failed
> sa_access_retrieve failed: -18
> 
> from the application.  This is the relevant opensm log generated by this
> query:
> 
> Jun 02 10:22:57 [44808960] -> __osm_sa_mad_ctrl_rcv_callback: [
> Jun 02 10:22:57 [44808960] -> __osm_sa_mad_ctrl_rcv_callback: 13802 QP1 MADs received.
> Jun 02 10:22:57 [44808960] -> SA MAD dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x3
>                                 class_ver...............0x2
>                                 method..................0x12 (SubnAdmGetTable)
>                                 status..................0x0
>                                 resv....................0x0
>                                 trans_id................0x976563100000010
>                                 attr_id.................0x31 (ServiceRecord)
>                                 resv1...................0x0
>                                 attr_mod................0xFFFFFFFF
>                                 rmpp_version............0x0
>                                 rmpp_type...............0x0
>                                 rmpp_flags..............0x0
>                                 rmpp_status.............0x0
>                                 seg_num.................0x0
>                                 payload_len/new_win.....0x0
>                                 sm_key..................0x0000000000000000
>                                 attr_offset.............0x0
>                                 resv2...................0x0
>                                 comp_mask...............0x0000000000000000
> 
> 
> Jun 02 10:22:57 [44808960] -> __osm_sa_mad_ctrl_process: [
> Jun 02 10:22:57 [44808960] -> __osm_sa_mad_ctrl_process: Posting Dispatcher message OSM_MSG_MAD_SERVICE_RECORD.
> Jun 02 10:22:57 [44808960] -> __osm_sa_mad_ctrl_process: ]

Something is not working correctly in the OpenIB RMPP space in terms of
Solaris RMPP. That is the cause of this (and the root cause of the osm
segv reported yesterday).

-- Hal




More information about the general mailing list