[ofa-general] IPoIB CQ overrun
akepner at sgi.com
akepner at sgi.com
Thu Nov 15 12:23:02 PST 2007
We have a large (~1800 node) IB cluster of x86_64 machines, and
we're having some significant problems with IPoIB.
The thing that all the IPoIB failures have in common seems to be
an appearance of a "CQ overrun" in syslog, e.g.:
ib_mthca 0000:06:00.0: CQ overrun on CQN 180082
>From there things go badly in different ways - tx_timeouts,
oopses, etc. Sometimes things just start working again after
a few minutes.
The appearance of these failures seems to be well correlated
with the size of the machine. I don't think there any problems
until the machine is built up to about its maximum size, and
then they become pretty common.
We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2.
Does this ring a bell with anyone?
--
Arthur
More information about the general
mailing list