[ofa-general] IPoIB CQ overrun

Chris Elmquist chrise at sgi.com
Fri Nov 16 15:23:31 PST 2007


On Thursday (11/15/2007 at 12:23PM -0800), akepner at sgi.com wrote:
> 
> We have a large (~1800 node) IB cluster of x86_64 machines, and 
> we're having some significant problems with IPoIB. 
> 
> The thing that all the IPoIB failures have in common seems to be 
> an appearance of a "CQ overrun" in syslog, e.g.:
> 
> ib_mthca 0000:06:00.0: CQ overrun on CQN 180082
> 
> >From there things go badly in different ways - tx_timeouts, 
> oopses, etc. Sometimes things just start working again after 
> a few minutes. 
> 
> The appearance of these failures seems to be well correlated 
> with the size of the machine. I don't think there any problems  
> until the machine is built up to about its maximum size, and 
> then they become pretty common.
> 
> We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2. 
> 
> Does this ring a bell with anyone? 

I can perhaps elaborate a little more on the test case we are using to
expose this situation...

On 1024 (or more) nodes, nttcp -i is started as a "tcp socket server".
Eight copies are started, each on a different tcp port (5000 ... 5007).

On another client node, as few as 1024 and as many as 8192 nttcp clients
are launched from that node to all of the 1024 others.  We can have
one connection between the client and each node or we can have eight
connections between the client and each node.  The nttcp test is run
for 120 secs and in these scenarios, all connections get established,
nttcp moves data, and never fails.  We get expected performance.

If the node count is increased to 1152, then things start to become
unreliable.  We will see connections fail to be established when we try
to do 8 per node.  If we do one per node, they will all establish and run.
In fact, we can do one per node across 1664 and that will succeed also.

So the problem seems to be related to the total number of nodes on
the fabric as well as how many TCP connections you try to establish to
each node.

One is tempted to believe it is a problem at the single node that is
opening all of these connections to the others...  but the failure occurs
on the nodes being connected to-- the nttcp servers-- with the CQ overrun
and TX WATCHDOG TIMEOUTS, etc. The final outcome of which is that we loose
all TCP connectivity over IB to the affect nodes for some period of time.
Sometimes they come back, sometimes they don't and sometimes its seconds
and sometimes its minutes before they come back.  Not very deterministic.

cje
-- 
Chris Elmquist          mailto:chrise at sgi.com      (651)683-3093
                        Silicon Graphics, Inc.     Eagan, MN



More information about the general mailing list