[ofa-general] XmtDiscards

Ira Weiny weiny2 at llnl.gov
Fri Apr 4 15:29:32 PDT 2008


On Sat, 5 Apr 2008 00:12:39 +0200
Bernd Schubert <bs at q-leap.de> wrote:

> Hello,
> 
> after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten 
> much better there, at least no further RcvSwRelayErrors, even when the 
> cluster is in idle state and so far also no SymbolErrors, which we also have 
> seens before.
> 
> However, after I just started a lustre stress test on 50 clients (to a lustre 
> storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 
> 9000 XmtDiscards within 30 minutes.

Yea, those are bad.

> 
> Searching for this error I find "This is a symptom of congestion and may 
> require tweaking either HOQ or switch lifetime values". 
> Well, I have to admit I neither know what HOQ is, nor do I know how to tweak 
> it. I also do not have an idea to set switch lifetime values.  I guess this 
> isn't related to the opensm timeout option, is it?

Yes you should adjust these values.

> 
> Hmm, I just found a cisci pdf describing how to set the lifetime on these 
> switches, but is this also possible on Flextronics switches?
> 

I don't know about the Vendor SMs but in opensm look for the following options
in the opensm.opts file (Default path is: /var/cache/opensm):

   # The code of maximal time a packet can wait at the head of
   # transmission queue.
   # The actual time is 4.096usec * 2^<head_of_queue_lifetime>
   # The value 0x14 disables this mechanism
   head_of_queue_lifetime 0x12
   
   # The maximal time a packet can wait at the head of queue on
   # switch port connected to a CA or router port
   leaf_head_of_queue_lifetime 0x0c

Ira



More information about the general mailing list