[openib-general] opensm and faulty hardware

Viswanath Krishnamurthy viswa.krish at gmail.com
Tue Sep 27 13:00:38 PDT 2005


Hal,

I added a hack now to get around the problem. There needs to be a proper fix
later..

[root at ibstg1 opensm]# svn diff osm_port.h
Index: osm_port.h
===================================================================
--- osm_port.h (revision 3549)
+++ osm_port.h (working copy)
@@ -1049,6 +1049,8 @@
{
CL_ASSERT( p_physp );
CL_ASSERT( osm_physp_is_valid( p_physp ) );
+ if (p_physp->port_info.base_lid == 0xFFFF)
+ return (0);
return( p_physp->port_info.base_lid );
}
/*


On 27 Sep 2005 15:11:05 -0400, Hal Rosenstock <halr at voltaire.com> wrote:
>
> On Tue, 2005-09-27 at 14:13, Viswanath Krishnamurthy wrote:
> > I tracked down the issue to a bug in osm_lid_mgr.c
> >
> > function: __osm_lid_mgr_init_sweep(...)
> >
> > The bad hardware was retutning an assigned LID of 0xFFFF. In this
> > function there is a loop
> > as follows where opensm is getting stuck.. (with line number)
> >
> > 392 p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;
> > 393
> > 394 for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl );
> > 395 p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl );
> > 396 p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item )
> > )
> > 397 {
> > 398 osm_port_get_lid_range_ho(p_port, &disc_min_lid,
> > &disc_max_lid);
> > 399 for (lid = disc_min_lid; lid <= disc_max_lid;
> > lid++) <===== Bug here
> > 400 cl_ptr_vector_set(p_discovered_vec, lid, p_port );
> > 401 }
> >
> > Since the disc_max_lid and disc_min_lid are 0xFFFF, and these are
> > unsigned 16 bit numbers, the condition
> > in the for loop never becomes false, and opensm is stuck in the loop.
> > There are couple of other places in that
> > function that needs fixing too.
>
> Sep 26 15:26:03 424135 [B66CFBB0] -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x2
> trans_id................0x1274
> attr_id.................0x15 (PortInfo)
> resv....................0x0
> attr_mod................0x1
> m_key...................0x0000000000000000
> dr_slid.................0xFFFF
> dr_dlid.................0xFFFF
>
>
> Sep 26 15:26:03 424407 [B6ED0BB0] -> __osm_nd_rcv_process_nd: Node
> 0x30d300002c7234
> Description = Agilent E2954A 4x Generator for InfiniBand.
> Sep 26 15:26:03 424426 [B6ED0BB0] -> __osm_nd_rcv_process_nd: ]
>
> Sep 26 15:26:03 679882 [B56CDBB0] -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x81 (SubnGetResp)
> D bit...................0x1
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x2
> trans_id................0x1274
> attr_id.................0x15 (PortInfo)
> resv....................0x0
> attr_mod................0x1
> m_key...................0x0000000000000000
> dr_slid.................0xFFFF
> dr_dlid.................0xFFFF
>
> Initial path: [0][1][12]
> Return path: [0][E][0]
>
>
> Sep 26 15:26:03 680291 [B76D1BB0] -> osm_pi_rcv_process: [
> Sep 26 15:26:03 680323 [B56CDBB0] -> __osm_sm_mad_ctrl_rcv_callback: ]
> Sep 26 15:26:03 680343 [B76D1BB0] -> PortInfo dump:
> port number.............0x1
> node_guid...............0x0030d300002c7234
> port_guid...............0x0030d300002c7234
> m_key...................0x0000000000000000
> subnet_prefix...........0xfe80000000000000
> base_lid................0xFFFF
>
> Yes, it appears the Agilent exerciser returned good status to a SM Get
> PortInfo with a base_lid of 0xffff. The base_lid should be validated by
> OpenSM.
>
> -- Hal
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050927/563fef44/attachment.html>


More information about the general mailing list