[ofa-general] IPoIB CQ overrun

akepner at sgi.com akepner at sgi.com
Thu Nov 15 12:23:02 PST 2007


We have a large (~1800 node) IB cluster of x86_64 machines, and 
we're having some significant problems with IPoIB. 

The thing that all the IPoIB failures have in common seems to be 
an appearance of a "CQ overrun" in syslog, e.g.:

ib_mthca 0000:06:00.0: CQ overrun on CQN 180082

>From there things go badly in different ways - tx_timeouts, 
oopses, etc. Sometimes things just start working again after 
a few minutes. 

The appearance of these failures seems to be well correlated 
with the size of the machine. I don't think there any problems  
until the machine is built up to about its maximum size, and 
then they become pretty common.

We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2. 

Does this ring a bell with anyone? 

-- 
Arthur




More information about the general mailing list