[openib-general] Re: Another opensm problem ?

Hal Rosenstock halr at voltaire.com
Fri Sep 23 15:54:25 PDT 2005


On Fri, 2005-09-23 at 14:57, Hal Rosenstock wrote:
> On Fri, 2005-09-23 at 13:50, Viswanath Krishnamurthy wrote:
> > - After 7-8 iterations, I ran into a weird problem, where opensm was
> > showing the HCA as UNKNOWN. The port
> > never came up to ACTIVE state.  The unplugged and replugged into
> > different slots, the port remained in INIT
> > state.
> 
> Mellanox    : SW : 12 : INI :      :     : 2048 : 1x  : 2.5 : 0002c9010d26e780 : UNKNOWN
> 
> OpenSM thinks that either there is no physical port on the other end of
> the link or it is not "valid" (GUID non 0). Obviously it is there as the
> port state is INIT so the physical link came up which requires the
> remote end to be there.

>From the log you sent, this is exactly what is happening.
Sep 23 10:07:23 451191 [B7751BB0] -> osm_drop_mgr_process: Checking port 0x0002c9010d26e780.
Sep 23 10:07:23 451209 [B7751BB0] -> osm_drop_mgr_process: Checking port 0x0002c90200400cfd.
Sep 23 10:07:23 451226 [B7751BB0] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9010d26e780 port 20. Adding to light sweep sampling list.
Sep 23 10:07:23 451251 [B7751BB0] -> Directed Path Dump of 1 hop path:
                                Path = [0][1]
Sep 23 10:07:23 451267 [B7751BB0] -> osm_drop_mgr_process: ]

So look in osm_drop_mgr.c line 707:
Can you enhance the log display to see which is failing: 
osm_physp_is_valid(p_physp) or osm_physp_get_remote(p_physp) ? 

Also, it appears to keep light sweeping this port but whichever switch
port it is on, it does not respond. Not sure where the problem is. It
could be on the outgoing side of the switch (we could run diags against
the switch and various ports; I would be curious what they return when
the subnet is in this broken state) or on the HCA. However, the fact
that restarting opensm made it go away without touching anything else
makes this appear otherwise.

> One other note is that it appears to have come up as 1x. Is that what
> should happen ? 

-- Hal




More information about the general mailing list