[openib-general] SM Bad Port Handling

Eitan Zahavi eitan at mellanox.co.il
Thu Apr 7 13:02:39 PDT 2005


Hi Hal,

Please see my comments below.

Eitan Zahavi

> Problem Statement:
> 
> Currently, OpenSM issues (directed route) SubnGet for NodeInfo and
> NodeDescription to any node it finds. It then requests PortInfo for
> each port which is physically up.
> 
> There are scenarios where the port is physically up, but there is no
> response to the SM get requests. In this case, the OpenSM keeps
> retrying, never gives up, and doesn't service anything else in the
> subnet (I'm not 100% positive on this last point).
[EZ] I have never seen this!  Are you sure about it? Are you sure we are
talking about gen1 ported to gen2?

What will happen in a case of non responding port is that OpenSM will retry
the send (actually the lower level does it) for the number of retries OpenSM
is configured to use (actually 4 times) and then ignore the port and
everything behind it. The reported topology (on stdout) will have the word
UNKNOWN on the remote side of the link this port connects to.

I will be happy to see a log file that shows what you claim happens. Or even
if you can explain to me how and where in the code causes that. 

I have been checking the way OpenSM handles irresponsive ports during the
the last two weeks, and did not see such case.
> 
> Assumption:
> 
> The proposed solution assumes that the ignore GUIDs file option of
> OpenSM only impacts the routing algorithm (path counting) and should not
> be extended for bad port handling.
[EZ] This is correct.
> 
> Proposed Solution:
> 
> The OpenSM will implement a configurable policy (some number of
> consecutive lack of responses to SM requests). At the point of
> exhaustion of the timeout/retry strategy, that port will be marked as
> "bad" by OpenSM.
[EZ] This is already the current behavior. Nothing should be done.
> 
> At this point, should it attempt to revive the port by bringing the
> physical link down and back up ? Should it try this several times before
> declaring the port as "bad" ? In any case, this is a refinement on the
> basic strategy for dealing with this scenario.
> 
> Also, there could also be a periodic "ping" at a slower rate to check if
> the "bad" ports revive.
[EZ] This will be released in gen1 within 2 weeks or so. The enhancement to
light sweep will include the irresponsive ports in the light sweep. Once
they respond a new heavy sweep will be generated.
> 
> A "bad" port per this scenario still maintains its LID and other state.
> OpenSM will indicate a "bad" port detected via an internal port physical
> state which it will set to down. The "real" port physical state will be
> reflected accurately inside OpenSM.
[EZ] It is better to use the "un-healthy" bit of the physical port - which
OpenSM is already maintaining.
> 
> Once a "bad" port is detected, it will no longer be polled and the
> routing algorithm should be invoked to route around this.
> 
> Is there a need to store these "bad" ports persistently (and ignore them
> on startup) ?
[EZ] No I do not think so.
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050407/070c5900/attachment.html>


More information about the general mailing list