[openib-general] SM Bad Port Handling

Eitan Zahavi eitan at mellanox.co.il
Wed Apr 13 02:20:52 PDT 2005


I probably did not make point very clear:

It is bad (not to say wrong) to disqualify a port and mark it as bad port if
it did not respond to queries.
The cause of the issue might be a flaky link on the directed route to the
port.
If the SM would be able to find that flaky link port it would avoid marking
the wrong ports. More over, the port that was almost marked as bad by the
simplistic algorithm you propose will be discovered and operational as there
many other paths to reach it - walking around the real bad port !

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Wednesday, April 13, 2005 12:00 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: RE: [openib-general] SM Bad Port Handling
> 
> On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote:
> > [EZ] This is true. Currently there is only one cause for the
> > un-healthy bits to be set - which are exactly as you point - these
> > traps. The point I was trying to make was that this bit is the
> > mechanism for flagging a port status is bad.
> >
> > What I did recommend was to write a "statistical" analysis of Directed
> > Route packet drop - such that we can find the ports with a high drop
> > rate and mark them as un-healthy. If you mark every port that does not
> > respond to a MAD as un-healthy you can suffer from flaky links
> > somewhere on the route to that port. Only analysis of the number of
> > good packets vs. dropped packets can lead you to the right bad port.
> 
> The original proposal on this said the following:
> 
> "The OpenSM will implement a configurable policy (some number of
> consecutive lack of responses to SM requests). At the point of
> exhaustion of the timeout/retry strategy, that port will be marked as
> "bad" by OpenSM."
> 
> Any idea on what might make a good default threshold (for consecutive
> retries) ? Do you think there is no sufficient default ?
> 
> If a link is flaky and MADs can't get through, should it be used for non
> MAD traffic ?
> 
> Also note that the proposal also said:
> 
> "Also, there could also be a periodic "ping" at a slower rate to check
> if the "bad" ports revive."
> 
> In terms of analysis of good v. errored and dropped packets (along the
> path to that node), there are OpenIB diagnostic tools to help with this.
> 
> -- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050413/abbfcf3c/attachment.html>


More information about the general mailing list