[ofa-general] XmtDiscards

Hal Rosenstock hrosenstock at xsigo.com
Sat Apr 5 06:17:59 PDT 2008


On Fri, 2008-04-04 at 17:48 -0700, Boris Shpolyansky wrote:
> Bernd,
> 
> 0x14 is the maximal value for HOQ lifetime, which effectively disables
> the mechanism. I think you shouldn't exceed this value. 

True about the maximal value but any 5 bit value > 19 (up through 31)
should effectively be the same thing according to the spec.

I also think that OpenSM could do a better job validating and setting
this and other similar optional parameters.

-- Hal

> Boris
> 
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Bernd
> Schubert
> Sent: Friday, April 04, 2008 4:46 PM
> To: Ira Weiny
> Cc: general at lists.openfabrics.org
> Subject: Re: [ofa-general] XmtDiscards
> 
> On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote:
> > On Sat, 5 Apr 2008 00:12:39 +0200
> > Bernd Schubert <bs at q-leap.de> wrote:
> > 
> > > Hello,
> > > 
> > > after I upgraded one of our clusters to opensm-3.2.1 it seems to 
> > > have gotten much better there, at least no further RcvSwRelayErrors,
> 
> > > even when the cluster is in idle state and so far also no 
> > > SymbolErrors, which we also have seens before.
> > > 
> > > However, after I just started a lustre stress test on 50 clients (to
> 
> > > a lustre storage system with 20 OSS servers and 60 OSTs), 
> > > ibcheckerrors reports about 9000 XmtDiscards within 30 minutes.
> > 
> > Yea, those are bad.
> > 
> > > 
> > > Searching for this error I find "This is a symptom of congestion and
> 
> > > may require tweaking either HOQ or switch lifetime values".
> > > Well, I have to admit I neither know what HOQ is, nor do I know how 
> > > to tweak it. I also do not have an idea to set switch lifetime 
> > > values.  I guess this isn't related to the opensm timeout option, is
> it?
> > 
> > Yes you should adjust these values.
> > 
> > > 
> > > Hmm, I just found a cisci pdf describing how to set the lifetime on 
> > > these switches, but is this also possible on Flextronics switches?
> > > 
> > 
> > I don't know about the Vendor SMs but in opensm look for the following
> 
> > options in the opensm.opts file (Default path is: /var/cache/opensm):
> > 
> >    # The code of maximal time a packet can wait at the head of
> >    # transmission queue.
> >    # The actual time is 4.096usec * 2^<head_of_queue_lifetime>
> >    # The value 0x14 disables this mechanism
> >    head_of_queue_lifetime 0x12
> >    
> >    # The maximal time a packet can wait at the head of queue on
> >    # switch port connected to a CA or router port
> >    leaf_head_of_queue_lifetime 0x0c
> 
> Hmm, I first increased head_of_queue_lifetime to 0x13 and
> leaf_head_of_queue_lifetime to 0x20, but this didn't make the error go
> away. So I increased head_of_queue_lifetime to 0x15 and
> leaf_head_of_queue_lifetime  to 0x50, but this made the fabric to
> entirely crash. On the node of the master opensm I got an endless number
> of messages like these:
> 
> Apr  5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0:
> transmit timed out Apr  5 01:35:03 pfs1n2 kernel: [705448.349814] ib0:
> transmit timeout: latency 411908 msecs Apr  5 01:35:03 pfs1n2 kernel:
> [705448.355364] ib0: queue stopped 1, tx_head 441, tx_tail 377 Apr  5
> 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit
> timed out
> 
> The slave opensm also went into D-state and is not killable anymore :(
> 
> Seems I have to be very careful with these settings...
> 
> 
> Thanks for your help,
> Bernd
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list