[openib-general] SM Bad Port Handling

Thu Apr 7 13:11:02 PDT 2005

On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote:
> Hi Hal,
> 
> Please see my comments below.
> 
> Eitan Zahavi
> 
> > Problem Statement:
> > 
> > Currently, OpenSM issues (directed route) SubnGet for NodeInfo and
> > NodeDescription to any node it finds. It then requests PortInfo for
> > each port which is physically up.
> > 
> > There are scenarios where the port is physically up, but there is no
> > response to the SM get requests. In this case, the OpenSM keeps
> > retrying, never gives up, and doesn't service anything else in the
> > subnet (I'm not 100% positive on this last point).
> [EZ] I have never seen this!  Are you sure about it? Are you sure we
> are talking about gen1 ported to gen2?
> 
> What will happen in a case of non responding port is that OpenSM will
> retry the send (actually the lower level does it) for the number of
> retries OpenSM is configured to use (actually 4 times) and then ignore
> the port and everything behind it. The reported topology (on stdout)
> will have the word UNKNOWN on the remote side of the link this port
> connects to.
> 
> I will be happy to see a log file that shows what you claim happens.
> Or even if you can explain to me how and where in the code causes
> that. 

This was reported by Ron a while ago on this list. He sent log extracts
of what was going on. It was around when I asked about the Anafa
firmware issue with LFTTop.

> I have been checking the way OpenSM handles irresponsive ports during
> the the last two weeks, and did not see such case.

Is this in both Gold 1.6.1 (OpenSM 1.7/1.7.1 ?) and Gold 1.7 (OpenSM
1.8) ? 

> > Assumption:
> > 
> > The proposed solution assumes that the ignore GUIDs file option of
> > OpenSM only impacts the routing algorithm (path counting) and should
> not
> > be extended for bad port handling.
> [EZ] This is correct.
> > 
> > Proposed Solution:
> > 
> > The OpenSM will implement a configurable policy (some number of
> > consecutive lack of responses to SM requests). At the point of
> > exhaustion of the timeout/retry strategy, that port will be marked
> as
> > "bad" by OpenSM.
> [EZ] This is already the current behavior. Nothing should be done.
> > 
> > At this point, should it attempt to revive the port by bringing the
> > physical link down and back up ? Should it try this several times
> before
> > declaring the port as "bad" ? In any case, this is a refinement on
> the
> > basic strategy for dealing with this scenario.
> > 
> > Also, there could also be a periodic "ping" at a slower rate to
> check if
> > the "bad" ports revive.
> [EZ] This will be released in gen1 within 2 weeks or so.

What OpenSM release will this be ?

>  The enhancement to light sweep will include the irresponsive ports in
> the light sweep. Once they respond a new heavy sweep will be
> generated.
> 
> > 
> > A "bad" port per this scenario still maintains its LID and other
> state.
> > OpenSM will indicate a "bad" port detected via an internal port
> physical
> > state which it will set to down. The "real" port physical state will
> be
> > reflected accurately inside OpenSM.
> [EZ] It is better to use the "un-healthy" bit of the physical port -
> which OpenSM is already maintaining.
> > 
> > Once a "bad" port is detected, it will no longer be polled and the
> > routing algorithm should be invoked to route around this.
> > 
> > Is there a need to store these "bad" ports persistently (and ignore
> them
> > on startup) ?
> [EZ] No I do not think so.

Thanks.

-- Hal