[openib-general] SM Bad Port Handling

Thu Apr 7 12:32:24 PDT 2005

Hi,

Below is a writeup on bad port handling by the SM. I would appreciate
any comments on this before I move on to the implementation.

Thanks.

-- Hal

Problem Statement:

Currently, OpenSM issues (directed route) SubnGet for NodeInfo and
NodeDescription to any node it finds. It then requests PortInfo for 
each port which is physically up.

There are scenarios where the port is physically up, but there is no
response to the SM get requests. In this case, the OpenSM keeps
retrying, never gives up, and doesn't service anything else in the
subnet (I'm not 100% positive on this last point).

Assumption:

The proposed solution assumes that the ignore GUIDs file option of
OpenSM only impacts the routing algorithm (path counting) and should not
be extended for bad port handling.

Proposed Solution:

The OpenSM will implement a configurable policy (some number of
consecutive lack of responses to SM requests). At the point of
exhaustion of the timeout/retry strategy, that port will be marked as
"bad" by OpenSM.

At this point, should it attempt to revive the port by bringing the
physical link down and back up ? Should it try this several times before
declaring the port as "bad" ? In any case, this is a refinement on the
basic strategy for dealing with this scenario.

Also, there could also be a periodic "ping" at a slower rate to check if
the "bad" ports revive.

A "bad" port per this scenario still maintains its LID and other state.
OpenSM will indicate a "bad" port detected via an internal port physical
state which it will set to down. The "real" port physical state will be
reflected accurately inside OpenSM.

Once a "bad" port is detected, it will no longer be polled and the
routing algorithm should be invoked to route around this. 

Is there a need to store these "bad" ports persistently (and ignore them
on startup) ?