[ofa-general] XmtDiscards
Ira Weiny
weiny2 at llnl.gov
Fri Apr 4 15:29:32 PDT 2008
On Sat, 5 Apr 2008 00:12:39 +0200
Bernd Schubert <bs at q-leap.de> wrote:
> Hello,
>
> after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten
> much better there, at least no further RcvSwRelayErrors, even when the
> cluster is in idle state and so far also no SymbolErrors, which we also have
> seens before.
>
> However, after I just started a lustre stress test on 50 clients (to a lustre
> storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about
> 9000 XmtDiscards within 30 minutes.
Yea, those are bad.
>
> Searching for this error I find "This is a symptom of congestion and may
> require tweaking either HOQ or switch lifetime values".
> Well, I have to admit I neither know what HOQ is, nor do I know how to tweak
> it. I also do not have an idea to set switch lifetime values. I guess this
> isn't related to the opensm timeout option, is it?
Yes you should adjust these values.
>
> Hmm, I just found a cisci pdf describing how to set the lifetime on these
> switches, but is this also possible on Flextronics switches?
>
I don't know about the Vendor SMs but in opensm look for the following options
in the opensm.opts file (Default path is: /var/cache/opensm):
# The code of maximal time a packet can wait at the head of
# transmission queue.
# The actual time is 4.096usec * 2^<head_of_queue_lifetime>
# The value 0x14 disables this mechanism
head_of_queue_lifetime 0x12
# The maximal time a packet can wait at the head of queue on
# switch port connected to a CA or router port
leaf_head_of_queue_lifetime 0x0c
Ira
More information about the general
mailing list