[ofa-general] poll CQ failed -2 with connectX

Rick Warner rick at microway.com
Tue Oct 28 07:02:38 PDT 2008


mvapich 1.  (0.9.9, 1.0.1, 1.1.0, depending on the OFED version, etc)

Thanks,
Rick
On Tuesday 28 October 2008, Pavel Shamis (Pasha) wrote:
> Which MPI implementation do you use ?
>
> Rick Warner wrote:
> > On Monday 27 October 2008, Rick Warner wrote:
> >> Hi all,
> >>
> >> I am configuring an opteron cluster with connectX Infiniband.  I have a
> >> problem that if I run one of the NAS tests, it works the first, and
> >> maybe 2nd time, but after that the jobs instantly fail with messages
> >> like this-
> >>
> >> [Rank 44][cm.c: line 860]poll CQ failed -2
> >> [Rank 51][cm.c: line 860]poll CQ failed -2
> >> [Rank 119][cm.c: line 860]poll CQ failed -2
> >> [Rank 85][cm.c: line 860]poll CQ failed -2
> >> [Rank 0][cm.c: line 860]poll CQ failed -2
> >> [Rank 9][cm.c: line 860]poll CQ failed -2
> >> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860]
> >> poll CQ failed -2
> >> [Rank 94][cm.c: line 860]poll CQ failed -2
> >> [Rank 111][cm.c: line 860]poll CQ failed -2
> >>
> >> I can easily reproduce this with only 2 systems using a 16 process LU
> >> job, class B.
> >>
> >> Here are the configs I've tried-
> >> Suse 11 with distro provided IB driver and libraries,etc, using mvapich
> >> as provided by ohio state
> >> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich
> >> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3
> >>
> >> They all have the same basic problem.  I think one of them reported
> >> "Error polling CQ" instead of "poll CQ failed".
> >>
> >> If I replace the connectX cards with regular DDR cards the problem goes
> >> away.
> >>
> >> I'm getting quite stumped at this point and would appreciate any
> >> suggestions or patches.
> >>
> >> Thanks,
> >> Rick
> >
> > I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4
> > and 2.6.27.2 kernel, using the in kernel drivers.
> >
> > Thanks,
> > Rick
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general



-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517



More information about the general mailing list