[ofa-general] XmtDiscards

Hal Rosenstock hrosenstock at xsigo.com
Sat Apr 5 06:19:43 PDT 2008


Hi Bernd,

On Sat, 2008-04-05 at 00:12 +0200, Bernd Schubert wrote:
> Hello,
> 
> after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten 
> much better there, at least no further RcvSwRelayErrors, even when the 
> cluster is in idle state and so far also no SymbolErrors, which we also have 
> seens before.
> 
> However, after I just started a lustre stress test on 50 clients (to a lustre 
> storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 
> 9000 XmtDiscards within 30 minutes.
> 
> Searching for this error I find "This is a symptom of congestion and may 
> require tweaking either HOQ or switch lifetime values". 
> Well, I have to admit I neither know what HOQ is, nor do I know how to tweak 
> it. I also do not have an idea to set switch lifetime values.  I guess this 
> isn't related to the opensm timeout option, is it?
> 
> Hmm, I just found a cisci pdf describing how to set the lifetime on these 
> switches, but is this also possible on Flextronics switches?

What routing algorithm are you using ? Rather than play with those
switch values, if you are not using up/down, could you try that to see
if it helps with the congestion you are seeing ?

-- Hal

> Thanks for any help,
> Bernd




More information about the general mailing list