[ofa-general] XmtDiscards

Fri Apr 4 16:45:47 PDT 2008

On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote:
> On Sat, 5 Apr 2008 00:12:39 +0200
> Bernd Schubert <bs at q-leap.de> wrote:
> 
> > Hello,
> > 
> > after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten 
> > much better there, at least no further RcvSwRelayErrors, even when the 
> > cluster is in idle state and so far also no SymbolErrors, which we also have 
> > seens before.
> > 
> > However, after I just started a lustre stress test on 50 clients (to a lustre 
> > storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 
> > 9000 XmtDiscards within 30 minutes.
> 
> Yea, those are bad.
> 
> > 
> > Searching for this error I find "This is a symptom of congestion and may 
> > require tweaking either HOQ or switch lifetime values". 
> > Well, I have to admit I neither know what HOQ is, nor do I know how to tweak 
> > it. I also do not have an idea to set switch lifetime values.  I guess this 
> > isn't related to the opensm timeout option, is it?
> 
> Yes you should adjust these values.
> 
> > 
> > Hmm, I just found a cisci pdf describing how to set the lifetime on these 
> > switches, but is this also possible on Flextronics switches?
> > 
> 
> I don't know about the Vendor SMs but in opensm look for the following options
> in the opensm.opts file (Default path is: /var/cache/opensm):
> 
>    # The code of maximal time a packet can wait at the head of
>    # transmission queue.
>    # The actual time is 4.096usec * 2^<head_of_queue_lifetime>
>    # The value 0x14 disables this mechanism
>    head_of_queue_lifetime 0x12
>    
>    # The maximal time a packet can wait at the head of queue on
>    # switch port connected to a CA or router port
>    leaf_head_of_queue_lifetime 0x0c

Hmm, I first increased head_of_queue_lifetime to 0x13 and 
leaf_head_of_queue_lifetime to 0x20, but this didn't make the error 
go away. So I increased head_of_queue_lifetime to 0x15 and 
leaf_head_of_queue_lifetime  to 0x50, but this made the fabric to entirely
crash. On the node of the master opensm I got an endless number of messages
like these:

Apr  5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: transmit timed out
Apr  5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: latency 411908 msecs
Apr  5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, tx_head 441, tx_tail 377
Apr  5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit timed out

The slave opensm also went into D-state and is not killable anymore :(

Seems I have to be very careful with these settings...

Thanks for your help,
Bernd