[openib-general] opensm and faulty hardware

Hal Rosenstock halr at voltaire.com
Tue Sep 27 12:11:05 PDT 2005


On Tue, 2005-09-27 at 14:13, Viswanath Krishnamurthy wrote:
> I tracked down the issue to a bug in osm_lid_mgr.c 
> 
> function:  __osm_lid_mgr_init_sweep(...)
> 
> The bad hardware was retutning an assigned LID of 0xFFFF. In this
> function there is a loop
> as follows where opensm is getting stuck.. (with line number)
> 
>     392   p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;
>     393
>     394   for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl );
>     395        p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl );
>     396        p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item )
> )
>     397   {
>     398     osm_port_get_lid_range_ho(p_port, &disc_min_lid,
> &disc_max_lid);
>     399     for (lid = disc_min_lid; lid <= disc_max_lid;
> lid++)                  <===== Bug here
>     400       cl_ptr_vector_set(p_discovered_vec, lid, p_port );
>     401   }
> 
> Since the disc_max_lid and disc_min_lid are 0xFFFF, and these are
> unsigned 16 bit numbers, the condition
> in the for loop never becomes false, and opensm is stuck in the loop. 
> There are couple of other places in that
> function that needs fixing too.

Sep 26 15:26:03 424135 [B66CFBB0] -> SMP dump:
                                base_ver................0x1
                                mgmt_class..............0x81
                                class_ver...............0x1
                                method..................0x1 (SubnGet)
                                D bit...................0x0
                                status..................0x0
                                hop_ptr.................0x0
                                hop_count...............0x2
                                trans_id................0x1274
                                attr_id.................0x15 (PortInfo)
                                resv....................0x0
                                attr_mod................0x1
                                m_key...................0x0000000000000000
                                dr_slid.................0xFFFF
                                dr_dlid.................0xFFFF


Sep 26 15:26:03 424407 [B6ED0BB0] -> __osm_nd_rcv_process_nd: Node 0x30d300002c7234
                                Description = Agilent E2954A 4x Generator for InfiniBand.
Sep 26 15:26:03 424426 [B6ED0BB0] -> __osm_nd_rcv_process_nd: ]

Sep 26 15:26:03 679882 [B56CDBB0] -> SMP dump:
                                base_ver................0x1
                                mgmt_class..............0x81
                                class_ver...............0x1
                                method..................0x81 (SubnGetResp)
                                D bit...................0x1
                                status..................0x0
                                hop_ptr.................0x0
                                hop_count...............0x2
                                trans_id................0x1274
                                attr_id.................0x15 (PortInfo)
                                resv....................0x0
                                attr_mod................0x1
                                m_key...................0x0000000000000000
                                dr_slid.................0xFFFF
                                dr_dlid.................0xFFFF

                                Initial path: [0][1][12]
                                Return path:  [0][E][0]


Sep 26 15:26:03 680291 [B76D1BB0] -> osm_pi_rcv_process: [
Sep 26 15:26:03 680323 [B56CDBB0] -> __osm_sm_mad_ctrl_rcv_callback: ]
Sep 26 15:26:03 680343 [B76D1BB0] -> PortInfo dump:
                                port number.............0x1
                                node_guid...............0x0030d300002c7234
                                port_guid...............0x0030d300002c7234
                                m_key...................0x0000000000000000
                                subnet_prefix...........0xfe80000000000000
                                base_lid................0xFFFF

Yes, it appears the Agilent exerciser returned good status to a SM Get
PortInfo with a base_lid of 0xffff. The base_lid should be validated by
OpenSM.

-- Hal




More information about the general mailing list