[ofa-general] XmtDiscards

Bernd Schubert bs at q-leap.de
Fri Apr 4 15:12:39 PDT 2008


Hello,

after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten 
much better there, at least no further RcvSwRelayErrors, even when the 
cluster is in idle state and so far also no SymbolErrors, which we also have 
seens before.

However, after I just started a lustre stress test on 50 clients (to a lustre 
storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 
9000 XmtDiscards within 30 minutes.

Searching for this error I find "This is a symptom of congestion and may 
require tweaking either HOQ or switch lifetime values". 
Well, I have to admit I neither know what HOQ is, nor do I know how to tweak 
it. I also do not have an idea to set switch lifetime values.  I guess this 
isn't related to the opensm timeout option, is it?

Hmm, I just found a cisci pdf describing how to set the lifetime on these 
switches, but is this also possible on Flextronics switches?


Thanks for any help,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH



More information about the general mailing list