[ofa-general] IPoIB CQ overrun

Sun Nov 18 00:27:23 PST 2007

Can you tell how IPOIB is configured - connected mode or datagram mode?
Also can you send more context from /var/log/messages? Especially can
you rerun with debug enabled and send the output?
Enabling debug can be done by:
echo 1 > /sys/module/ib_ipoib/parameters/debug_level

On Fri, 2007-11-16 at 17:23 -0600, Chris Elmquist wrote:
> On Thursday (11/15/2007 at 12:23PM -0800), akepner at sgi.com wrote:
> > 
> > We have a large (~1800 node) IB cluster of x86_64 machines, and 
> > we're having some significant problems with IPoIB. 
> > 
> > The thing that all the IPoIB failures have in common seems to be 
> > an appearance of a "CQ overrun" in syslog, e.g.:
> > 
> > ib_mthca 0000:06:00.0: CQ overrun on CQN 180082
> > 
> > >From there things go badly in different ways - tx_timeouts, 
> > oopses, etc. Sometimes things just start working again after 
> > a few minutes. 
> > 
> > The appearance of these failures seems to be well correlated 
> > with the size of the machine. I don't think there any problems  
> > until the machine is built up to about its maximum size, and 
> > then they become pretty common.
> > 
> > We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2. 
> > 
> > Does this ring a bell with anyone? 
> 
> I can perhaps elaborate a little more on the test case we are using to
> expose this situation...
> 
> On 1024 (or more) nodes, nttcp -i is started as a "tcp socket server".
> Eight copies are started, each on a different tcp port (5000 ... 5007).
> 
> On another client node, as few as 1024 and as many as 8192 nttcp clients
> are launched from that node to all of the 1024 others.  We can have
> one connection between the client and each node or we can have eight
> connections between the client and each node.  The nttcp test is run
> for 120 secs and in these scenarios, all connections get established,
> nttcp moves data, and never fails.  We get expected performance.
> 
> If the node count is increased to 1152, then things start to become
> unreliable.  We will see connections fail to be established when we try
> to do 8 per node.  If we do one per node, they will all establish and run.
> In fact, we can do one per node across 1664 and that will succeed also.
> 
> So the problem seems to be related to the total number of nodes on
> the fabric as well as how many TCP connections you try to establish to
> each node.
> 
> One is tempted to believe it is a problem at the single node that is
> opening all of these connections to the others...  but the failure occurs
> on the nodes being connected to-- the nttcp servers-- with the CQ overrun
> and TX WATCHDOG TIMEOUTS, etc. The final outcome of which is that we loose
> all TCP connectivity over IB to the affect nodes for some period of time.
> Sometimes they come back, sometimes they don't and sometimes its seconds
> and sometimes its minutes before they come back.  Not very deterministic.
> 
> cje