[openib-general] SM Bad Port Handling

Wed Apr 13 07:03:21 PDT 2005

Eitan, 

	Your analysis is not completely accurate. The SM configure the
subnet using direct mads only, and it builds a spanning tree of direct
routes. What I want to say, is that that it doesn't matter why exactly a
port is unreachable. Once a port can not be reached, you can either
retry the entire heavy sweep process, but if the problem repeats itself
(X times) on the same port, you have no alternative other then disable
it. If the SM will have an alternative method of building direct paths,
then such alternative path could be attempted. Currently it is not
relevant. Speaking of "statistical analysis", what are the odds that a
port will behave well when it is queried directly, but starts to loose
packets when a direct route is routed through it, and behave
consistently during all retries? Again, even if this is the case (and in
understatement, I am not sure how frequent it is), the port behind it is
unreachable and therefore "bad".

The current unhealthy port mechanism is not redundant to this "bad" port
mechanism because it does not handle the same case. Both mechanisms are
required. The issue if they can share the same status bit is really an
implementation issue.

Relying of traps is very problematic in some cases, particularly in
initial bring up sweep when the SM lid is not even configured (remember
VTEC?).

Shahar   

________________________________________
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi
Sent: Wednesday, April 13, 2005 11:21 AM
To: Hal Rosenstock; Eitan Zahavi
Cc: openib-general at openib.org
Subject: RE: [openib-general] SM Bad Port Handling

I probably did not make point very clear: 
It is bad (not to say wrong) to disqualify a port and mark it as bad
port if it did not respond to queries. 
The cause of the issue might be a flaky link on the directed route to
the port. 
If the SM would be able to find that flaky link port it would avoid
marking the wrong ports. More over, the port that was almost marked as
bad by the simplistic algorithm you propose will be discovered and
operational as there many other paths to reach it - walking around the
real bad port !
Eitan Zahavi 
Design Technology Director 
Mellanox Technologies LTD 
Tel:+972-4-9097208 
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 

> -----Original Message----- 
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Wednesday, April 13, 2005 12:00 PM 
> To: Eitan Zahavi 
> Cc: openib-general at openib.org 
> Subject: RE: [openib-general] SM Bad Port Handling 
> 
> On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote: 
> > [EZ] This is true. Currently there is only one cause for the 
> > un-healthy bits to be set - which are exactly as you point - these 
> > traps. The point I was trying to make was that this bit is the 
> > mechanism for flagging a port status is bad. 
> > 
> > What I did recommend was to write a "statistical" analysis of
Directed 
> > Route packet drop - such that we can find the ports with a high drop

> > rate and mark them as un-healthy. If you mark every port that does
not 
> > respond to a MAD as un-healthy you can suffer from flaky links 
> > somewhere on the route to that port. Only analysis of the number of 
> > good packets vs. dropped packets can lead you to the right bad port.

> 
> The original proposal on this said the following: 
> 
> "The OpenSM will implement a configurable policy (some number of 
> consecutive lack of responses to SM requests). At the point of 
> exhaustion of the timeout/retry strategy, that port will be marked as 
> "bad" by OpenSM." 
> 
> Any idea on what might make a good default threshold (for consecutive 
> retries) ? Do you think there is no sufficient default ? 
> 
> If a link is flaky and MADs can't get through, should it be used for
non 
> MAD traffic ? 
> 
> Also note that the proposal also said: 
> 
> "Also, there could also be a periodic "ping" at a slower rate to check

> if the "bad" ports revive." 
> 
> In terms of analysis of good v. errored and dropped packets (along the

> path to that node), there are OpenIB diagnostic tools to help with
this. 
> 
> -- Hal