Hal,<br>
<br>
I added a hack now to get around the problem. There needs to be a proper fix later..<br>
<br>
[root@ibstg1 opensm]# svn diff osm_port.h<br>
Index: osm_port.h<br>
===================================================================<br>
--- osm_port.h  (revision 3549)<br>
+++ osm_port.h  (working copy)<br>
@@ -1049,6 +1049,8 @@<br>
 {<br>
        CL_ASSERT( p_physp );<br>
        CL_ASSERT( osm_physp_is_valid( p_physp ) );<br>
+       if (p_physp->port_info.base_lid == 0xFFFF)<br>
+               return (0);<br>
        return( p_physp->port_info.base_lid );<br>
 }<br>
 /*<br>
<br><br><div><span class="gmail_quote">On 27 Sep 2005 15:11:05 -0400, <b class="gmail_sendername">Hal Rosenstock</b> <<a href="mailto:halr@voltaire.com">halr@voltaire.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
On Tue, 2005-09-27 at 14:13, Viswanath Krishnamurthy wrote:<br>> I tracked down the issue to a bug in osm_lid_mgr.c<br>><br>> function:  __osm_lid_mgr_init_sweep(...)<br>><br>> The bad hardware was retutning an assigned LID of 0xFFFF. In this
<br>> function there is a loop<br>> as follows where opensm is getting stuck.. (with line number)<br>><br>>     392   p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;<br>>     393<br>>     394   for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl );
<br>>    
395        p_port !=
(osm_port_t*)cl_qmap_end( p_port_guid_tbl );<br>>    
396        p_port =
(osm_port_t*)cl_qmap_next( &p_port->map_item )<br>> )<br>>     397   {<br>>     398     osm_port_get_lid_range_ho(p_port, &disc_min_lid,<br>> &disc_max_lid);<br>>     399     for (lid = disc_min_lid; lid <= disc_max_lid;
<br>>
lid++)                  <=====
Bug here<br>>     400       cl_ptr_vector_set(p_discovered_vec, lid, p_port );<br>>     401   }<br>><br>> Since the disc_max_lid and disc_min_lid are 0xFFFF, and these are<br>> unsigned 16 bit numbers, the condition
<br>> in the for loop never becomes false, and opensm is stuck in the loop.<br>> There are couple of other places in that<br>> function that needs fixing too.<br><br>Sep 26 15:26:03 424135 [B66CFBB0] -> SMP dump:
<br>                                base_ver................0x1<br>                                mgmt_class..............0x81<br>                                class_ver...............0x1<br>                                method..................0x1
(SubnGet)<br>                                D
bit...................0x0<br>                                status..................0x0<br>                                hop_ptr.................0x0<br>                                hop_count...............0x2<br>                                trans_id................0x1274
<br>                                attr_id.................0x15
(PortInfo)<br>                                resv....................0x0<br>                                attr_mod................0x1<br>                                m_key...................0x0000000000000000<br>                                dr_slid.................0xFFFF
<br>                                dr_dlid.................0xFFFF<br><br><br>Sep 26 15:26:03 424407 [B6ED0BB0] -> __osm_nd_rcv_process_nd: Node 0x30d300002c7234<br>                                Description
= Agilent E2954A 4x Generator for InfiniBand.<br>Sep 26 15:26:03 424426 [B6ED0BB0] -> __osm_nd_rcv_process_nd: ]<br><br>Sep 26 15:26:03 679882 [B56CDBB0] -> SMP dump:<br>                                base_ver................0x1
<br>                                mgmt_class..............0x81<br>                                class_ver...............0x1<br>                                method..................0x81
(SubnGetResp)<br>                                D
bit...................0x1<br>                                status..................0x0<br>                                hop_ptr.................0x0<br>                                hop_count...............0x2<br>                                trans_id................0x1274
<br>                                attr_id.................0x15
(PortInfo)<br>                                resv....................0x0<br>                                attr_mod................0x1<br>                                m_key...................0x0000000000000000<br>                                dr_slid.................0xFFFF
<br>                                dr_dlid.................0xFFFF<br><br>                                Initial
path: [0][1][12]<br>                                Return
path:  [0][E][0]<br><br><br>Sep 26 15:26:03 680291 [B76D1BB0] -> osm_pi_rcv_process: [<br>Sep 26 15:26:03 680323 [B56CDBB0] -> __osm_sm_mad_ctrl_rcv_callback: ]<br>Sep 26 15:26:03 680343 [B76D1BB0] -> PortInfo dump:
<br>                                port
number.............0x1<br>                                node_guid...............0x0030d300002c7234<br>                                port_guid...............0x0030d300002c7234<br>                                m_key...................0x0000000000000000
<br>                                subnet_prefix...........0xfe80000000000000<br>                                base_lid................0xFFFF<br><br>Yes, it appears the Agilent exerciser returned good status to a SM Get
<br>PortInfo with a base_lid of 0xffff. The base_lid should be validated by<br>OpenSM.<br><br>-- Hal<br><br></blockquote></div><br>