<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=US-ASCII">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2654.45">
<TITLE>RE: [openib-general] SM Bad Port Handling</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=2>I probably did not make point very clear:</FONT>
</P>
<P><FONT SIZE=2>It is bad (not to say wrong) to disqualify a port and mark it as bad port if it did not respond to queries.</FONT>
<BR><FONT SIZE=2>The cause of the issue might be a flaky link on the directed route to the port.</FONT>
<BR><FONT SIZE=2>If the SM would be able to find that flaky link port it would avoid marking the wrong ports. More over, the port that was almost marked as bad by the simplistic algorithm you propose will be discovered and operational as there many other paths to reach it - walking around the real bad port !</FONT></P>
<P><FONT SIZE=2>Eitan Zahavi</FONT>
<BR><FONT SIZE=2>Design Technology Director</FONT>
<BR><FONT SIZE=2>Mellanox Technologies LTD</FONT>
<BR><FONT SIZE=2>Tel:+972-4-9097208</FONT>
<BR><FONT SIZE=2>Fax:+972-4-9593245</FONT>
<BR><FONT SIZE=2>P.O. Box 586 Yokneam 20692 ISRAEL</FONT>
</P>
<BR>
<P><FONT SIZE=2>> -----Original Message-----</FONT>
<BR><FONT SIZE=2>> From: Hal Rosenstock [<A HREF="mailto:halr@voltaire.com">mailto:halr@voltaire.com</A>]</FONT>
<BR><FONT SIZE=2>> Sent: Wednesday, April 13, 2005 12:00 PM</FONT>
<BR><FONT SIZE=2>> To: Eitan Zahavi</FONT>
<BR><FONT SIZE=2>> Cc: openib-general@openib.org</FONT>
<BR><FONT SIZE=2>> Subject: RE: [openib-general] SM Bad Port Handling</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote:</FONT>
<BR><FONT SIZE=2>> > [EZ] This is true. Currently there is only one cause for the</FONT>
<BR><FONT SIZE=2>> > un-healthy bits to be set - which are exactly as you point - these</FONT>
<BR><FONT SIZE=2>> > traps. The point I was trying to make was that this bit is the</FONT>
<BR><FONT SIZE=2>> > mechanism for flagging a port status is bad.</FONT>
<BR><FONT SIZE=2>> ></FONT>
<BR><FONT SIZE=2>> > What I did recommend was to write a "statistical" analysis of Directed</FONT>
<BR><FONT SIZE=2>> > Route packet drop - such that we can find the ports with a high drop</FONT>
<BR><FONT SIZE=2>> > rate and mark them as un-healthy. If you mark every port that does not</FONT>
<BR><FONT SIZE=2>> > respond to a MAD as un-healthy you can suffer from flaky links</FONT>
<BR><FONT SIZE=2>> > somewhere on the route to that port. Only analysis of the number of</FONT>
<BR><FONT SIZE=2>> > good packets vs. dropped packets can lead you to the right bad port.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> The original proposal on this said the following:</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> "The OpenSM will implement a configurable policy (some number of</FONT>
<BR><FONT SIZE=2>> consecutive lack of responses to SM requests). At the point of</FONT>
<BR><FONT SIZE=2>> exhaustion of the timeout/retry strategy, that port will be marked as</FONT>
<BR><FONT SIZE=2>> "bad" by OpenSM."</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Any idea on what might make a good default threshold (for consecutive</FONT>
<BR><FONT SIZE=2>> retries) ? Do you think there is no sufficient default ?</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> If a link is flaky and MADs can't get through, should it be used for non</FONT>
<BR><FONT SIZE=2>> MAD traffic ?</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Also note that the proposal also said:</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> "Also, there could also be a periodic "ping" at a slower rate to check</FONT>
<BR><FONT SIZE=2>> if the "bad" ports revive."</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> In terms of analysis of good v. errored and dropped packets (along the</FONT>
<BR><FONT SIZE=2>> path to that node), there are OpenIB diagnostic tools to help with this.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> -- Hal</FONT>
</P>
</BODY>
</HTML>