[ofa-general] XmtDiscards

Bernd Schubert bs at q-leap.de
Sun Apr 6 09:09:41 PDT 2008


Hello Sasha,

On Sun, Apr 06, 2008 at 06:53:14AM +0000, Sasha Khapyorsky wrote:
> On 01:45 Sat 05 Apr     , Bernd Schubert wrote:
> > 
> > Hmm, I first increased head_of_queue_lifetime to 0x13 and 
> > leaf_head_of_queue_lifetime to 0x20, but this didn't make the error 
> > go away. So I increased head_of_queue_lifetime to 0x15 and 
> > leaf_head_of_queue_lifetime  to 0x50, but this made the fabric to entirely
> > crash.
> 
> Are you using default (min hops) routing? I think it could be deadlock
> due to unlimited head_of_queue_lifetime values.
> 
> > On the node of the master opensm I got an endless number of messages
> > like these:
> > 
> > Apr  5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: transmit timed out
> > Apr  5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: latency 411908 msecs
> > Apr  5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, tx_head 441, tx_tail 377
> > Apr  5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit timed out
> > 
> > The slave opensm also went into D-state and is not killable anymore :(
> 
> Interesting... Any more details about this?

unfortunately not. As you may see, it was rather late already and I just 
wanted to get the entire system working, so I rebooted both
nodes running the opensms :(


Thanks,
Bernd



More information about the general mailing list