[ofa-general] XmtDiscards
Hal Rosenstock
hrosenstock at xsigo.com
Sat Apr 5 06:19:43 PDT 2008
Hi Bernd,
On Sat, 2008-04-05 at 00:12 +0200, Bernd Schubert wrote:
> Hello,
>
> after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten
> much better there, at least no further RcvSwRelayErrors, even when the
> cluster is in idle state and so far also no SymbolErrors, which we also have
> seens before.
>
> However, after I just started a lustre stress test on 50 clients (to a lustre
> storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about
> 9000 XmtDiscards within 30 minutes.
>
> Searching for this error I find "This is a symptom of congestion and may
> require tweaking either HOQ or switch lifetime values".
> Well, I have to admit I neither know what HOQ is, nor do I know how to tweak
> it. I also do not have an idea to set switch lifetime values. I guess this
> isn't related to the opensm timeout option, is it?
>
> Hmm, I just found a cisci pdf describing how to set the lifetime on these
> switches, but is this also possible on Flextronics switches?
What routing algorithm are you using ? Rather than play with those
switch values, if you are not using up/down, could you try that to see
if it helps with the congestion you are seeing ?
-- Hal
> Thanks for any help,
> Bernd
More information about the general
mailing list