[openib-general] SM Bad Port Handling

Wed Apr 13 02:00:05 PDT 2005

On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote:
> [EZ] This is true. Currently there is only one cause for the
> un-healthy bits to be set - which are exactly as you point - these
> traps. The point I was trying to make was that this bit is the
> mechanism for flagging a port status is bad. 
> 
> What I did recommend was to write a "statistical" analysis of Directed
> Route packet drop - such that we can find the ports with a high drop
> rate and mark them as un-healthy. If you mark every port that does not
> respond to a MAD as un-healthy you can suffer from flaky links
> somewhere on the route to that port. Only analysis of the number of
> good packets vs. dropped packets can lead you to the right bad port.

The original proposal on this said the following:

"The OpenSM will implement a configurable policy (some number of
consecutive lack of responses to SM requests). At the point of
exhaustion of the timeout/retry strategy, that port will be marked as
"bad" by OpenSM."

Any idea on what might make a good default threshold (for consecutive
retries) ? Do you think there is no sufficient default ?

If a link is flaky and MADs can't get through, should it be used for non
MAD traffic ?

Also note that the proposal also said:

"Also, there could also be a periodic "ping" at a slower rate to check
if the "bad" ports revive."

In terms of analysis of good v. errored and dropped packets (along the
path to that node), there are OpenIB diagnostic tools to help with this.

-- Hal