[ofa-general] poll CQ failed -2 with connectX

Rick Warner rick at microway.com
Mon Oct 27 15:44:02 PDT 2008


On Monday 27 October 2008, Rick Warner wrote:
> Hi all,
>
> I am configuring an opteron cluster with connectX Infiniband.  I have a
> problem that if I run one of the NAS tests, it works the first, and maybe
> 2nd time, but after that the jobs instantly fail with messages like this-
>
> [Rank 44][cm.c: line 860]poll CQ failed -2
> [Rank 51][cm.c: line 860]poll CQ failed -2
> [Rank 119][cm.c: line 860]poll CQ failed -2
> [Rank 85][cm.c: line 860]poll CQ failed -2
> [Rank 0][cm.c: line 860]poll CQ failed -2
> [Rank 9][cm.c: line 860]poll CQ failed -2
> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860]
> poll CQ failed -2
> [Rank 94][cm.c: line 860]poll CQ failed -2
> [Rank 111][cm.c: line 860]poll CQ failed -2
>
> I can easily reproduce this with only 2 systems using a 16 process LU job,
> class B.
>
> Here are the configs I've tried-
> Suse 11 with distro provided IB driver and libraries,etc, using mvapich as
> provided by ohio state
> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich
> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3
>
> They all have the same basic problem.  I think one of them reported "Error
> polling CQ" instead of "poll CQ failed".
>
> If I replace the connectX cards with regular DDR cards the problem goes
> away.
>
> I'm getting quite stumped at this point and would appreciate any
> suggestions or patches.
>
> Thanks,
> Rick

I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4 and 
2.6.27.2 kernel, using the in kernel drivers.

Thanks,
Rick

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517



More information about the general mailing list