[ofa-general] poll CQ failed -2 with connectX
Pavel Shamis (Pasha)
pasha at dev.mellanox.co.il
Tue Oct 28 06:25:16 PDT 2008
Which MPI implementation do you use ?
Rick Warner wrote:
> On Monday 27 October 2008, Rick Warner wrote:
>
>> Hi all,
>>
>> I am configuring an opteron cluster with connectX Infiniband. I have a
>> problem that if I run one of the NAS tests, it works the first, and maybe
>> 2nd time, but after that the jobs instantly fail with messages like this-
>>
>> [Rank 44][cm.c: line 860]poll CQ failed -2
>> [Rank 51][cm.c: line 860]poll CQ failed -2
>> [Rank 119][cm.c: line 860]poll CQ failed -2
>> [Rank 85][cm.c: line 860]poll CQ failed -2
>> [Rank 0][cm.c: line 860]poll CQ failed -2
>> [Rank 9][cm.c: line 860]poll CQ failed -2
>> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860]
>> poll CQ failed -2
>> [Rank 94][cm.c: line 860]poll CQ failed -2
>> [Rank 111][cm.c: line 860]poll CQ failed -2
>>
>> I can easily reproduce this with only 2 systems using a 16 process LU job,
>> class B.
>>
>> Here are the configs I've tried-
>> Suse 11 with distro provided IB driver and libraries,etc, using mvapich as
>> provided by ohio state
>> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich
>> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3
>>
>> They all have the same basic problem. I think one of them reported "Error
>> polling CQ" instead of "poll CQ failed".
>>
>> If I replace the connectX cards with regular DDR cards the problem goes
>> away.
>>
>> I'm getting quite stumped at this point and would appreciate any
>> suggestions or patches.
>>
>> Thanks,
>> Rick
>>
>
> I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4 and
> 2.6.27.2 kernel, using the in kernel drivers.
>
> Thanks,
> Rick
>
>
More information about the general
mailing list