[ofa-general] poll CQ failed -2 with connectX

Rick Warner ricklist at microway.com
Tue Oct 28 13:39:02 PDT 2008


Hi Eli,

Thanks for the suggestion.  Unfortunately, I have now reproduced this same 
problem on a group of 8 Xeon based systems as well, so the problem is not 
specific to the Opterons.

Thanks,
Rick

On Tuesday 28 October 2008, Eli Cohen wrote:
> On Mon, Oct 27, 2008 at 06:38:48PM -0400, Rick Warner wrote:
> > Hi all,
> >
> > I am configuring an opteron cluster with connectX Infiniband.  I have a
> > problem that if I run one of the NAS tests, it works the first, and maybe
> > 2nd time, but after that the jobs instantly fail with messages like this-
> >
> > [Rank 44][cm.c: line 860]poll CQ failed -2
> > [Rank 51][cm.c: line 860]poll CQ failed -2
> > [Rank 119][cm.c: line 860]poll CQ failed -2
> > [Rank 85][cm.c: line 860]poll CQ failed -2
> > [Rank 0][cm.c: line 860]poll CQ failed -2
> > [Rank 9][cm.c: line 860]poll CQ failed -2
> > [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860]
> > poll CQ failed -2
> > [Rank 94][cm.c: line 860]poll CQ failed -2
> > [Rank 111][cm.c: line 860]poll CQ failed -2
>
> This error means that a CQE was polled which belongs to a none
> existent QP. But, I do remember a case with an Opteron which
> experienced the same problem and eventually it appeared that it was a
> system problem that was resolved after a BIOS update. Can you check if
> there is an update to your system's BIOS?
>
> > I can easily reproduce this with only 2 systems using a 16 process LU
> > job, class B.
> >
> > Here are the configs I've tried-
> > Suse 11 with distro provided IB driver and libraries,etc, using mvapich
> > as provided by ohio state
> > Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich
> > Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3
> >
> > They all have the same basic problem.  I think one of them reported
> > "Error polling CQ" instead of "poll CQ failed".
> >
> > If I replace the connectX cards with regular DDR cards the problem goes
> > away.
> >
> > I'm getting quite stumped at this point and would appreciate any
> > suggestions or patches.
> >
> > Thanks,
> > Rick
> > --
> > Richard Warner
> > Lead Systems Integrator
> > Microway, Inc
> > (508)732-5517
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general



-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517



More information about the general mailing list