[ofa-general] IPoIB CQ overrun
Eli Cohen
eli at dev.mellanox.co.il
Sun Nov 18 00:27:23 PST 2007
Can you tell how IPOIB is configured - connected mode or datagram mode?
Also can you send more context from /var/log/messages? Especially can
you rerun with debug enabled and send the output?
Enabling debug can be done by:
echo 1 > /sys/module/ib_ipoib/parameters/debug_level
On Fri, 2007-11-16 at 17:23 -0600, Chris Elmquist wrote:
> On Thursday (11/15/2007 at 12:23PM -0800), akepner at sgi.com wrote:
> >
> > We have a large (~1800 node) IB cluster of x86_64 machines, and
> > we're having some significant problems with IPoIB.
> >
> > The thing that all the IPoIB failures have in common seems to be
> > an appearance of a "CQ overrun" in syslog, e.g.:
> >
> > ib_mthca 0000:06:00.0: CQ overrun on CQN 180082
> >
> > >From there things go badly in different ways - tx_timeouts,
> > oopses, etc. Sometimes things just start working again after
> > a few minutes.
> >
> > The appearance of these failures seems to be well correlated
> > with the size of the machine. I don't think there any problems
> > until the machine is built up to about its maximum size, and
> > then they become pretty common.
> >
> > We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2.
> >
> > Does this ring a bell with anyone?
>
> I can perhaps elaborate a little more on the test case we are using to
> expose this situation...
>
> On 1024 (or more) nodes, nttcp -i is started as a "tcp socket server".
> Eight copies are started, each on a different tcp port (5000 ... 5007).
>
> On another client node, as few as 1024 and as many as 8192 nttcp clients
> are launched from that node to all of the 1024 others. We can have
> one connection between the client and each node or we can have eight
> connections between the client and each node. The nttcp test is run
> for 120 secs and in these scenarios, all connections get established,
> nttcp moves data, and never fails. We get expected performance.
>
> If the node count is increased to 1152, then things start to become
> unreliable. We will see connections fail to be established when we try
> to do 8 per node. If we do one per node, they will all establish and run.
> In fact, we can do one per node across 1664 and that will succeed also.
>
> So the problem seems to be related to the total number of nodes on
> the fabric as well as how many TCP connections you try to establish to
> each node.
>
> One is tempted to believe it is a problem at the single node that is
> opening all of these connections to the others... but the failure occurs
> on the nodes being connected to-- the nttcp servers-- with the CQ overrun
> and TX WATCHDOG TIMEOUTS, etc. The final outcome of which is that we loose
> all TCP connectivity over IB to the affect nodes for some period of time.
> Sometimes they come back, sometimes they don't and sometimes its seconds
> and sometimes its minutes before they come back. Not very deterministic.
>
> cje
More information about the general
mailing list