[ofa-general] IPoIB CQ overrun
Roland Dreier
rdreier at cisco.com
Mon Nov 19 20:29:36 PST 2007
> The thing that all the IPoIB failures have in common seems to be
> an appearance of a "CQ overrun" in syslog, e.g.:
>
> ib_mthca 0000:06:00.0: CQ overrun on CQN 180082
> We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2.
OFED 1.2 uses a separate CQ for send completions in connected mode.
(I'm assuming you're using the OFED default of connected mode for
IPoIB). I guess it would be useful to know which CQ is overrunning,
ie whether it is the main IPoIB CQ or one of the CM send CQs. One way
to check this would be to add a print to mthca to dump the CQN when a
CQ is created, and also add prints to IPoIB just before each call to
ib_create_cq() so that the CQNs can be correlated.
Another thing you could try would be a 2.6.24-rc kernel (or an OFED
1.3 prerelease I guess), which has a change that moves all completions
into one CQ in IPoIB. This may fix the bug by accident.
- R.
More information about the general
mailing list