[ofa-general] poll CQ failed -2 with connectX

Jack Morgenstein jackm at dev.mellanox.co.il
Mon Nov 3 01:39:27 PST 2008


Rick,

Your problem was that you had a SUSE-packaged ofed-driver set 
(named ofed-kmp-default) installed on all your
machines (maybe automatically part of the OpenSuse install?):

For example, on one of your hosts, I ran
#> rpm -qi ofed-kmp-default
Name        : ofed-kmp-default             Relocations: (not relocatable)
Version     : 1.2.5_2.6.22.18_0.2               Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany
Release     : 18.1                          Build Date: Mon Jun  9 12:42:40 2008
Install Date: Wed Jul 30 18:26:56 2008      Build Host: kalman.suse.de
Group       : System/Base                   Source RPM: ofed-1.2.5-18.1.src.rpm
Size        : 3359904                          License: GPL v2 or later
Signature   : DSA/SHA1, Mon Jun  9 12:47:02 2008, Key ID a84edae89c800aca
Packager    : http://bugs.opensuse.org
URL         : http://www.openfabrics.org
Summary     : Infiniband Kernel Modules

The SUSE-rpm driver set is based on OFED 1.2.5.
This RPM installs the OFED drivers under directory /lib/modules/<kernel version/updates/drivers.

When you then installed the OFED 1.3.1 and OFED 1.4 drivers, these new drivers were installed under
/lib/modules/<kernel version/updates/kernel/drivers, but the SUSE drivers were not uninstalled.

Both sets were present on the hosts.

When you started up the infiniband driver (/etc/init.d/openibd start), the older OFED 1.2.5 driver
was loaded into the kernel.

However, the userspace drivers used were indeed from OFED 1.3.1 and/or OFED 1.4, resulting in a mismatch
between kernel-space and userspace.

Specifically, ConnectX cards support XRC (Extended RC) in OFED 1.3.1 and OFED 1.4 (XRC was not present
in OFED 1.2.5).  The 1.3.1 / 1.4 userspace libraries identified some of the QPs created by the OFED 1.2.5
kernel modules as XRC QPs and returned an error as a result (correctly indicating that these "XRC" qp's
did not exist as XRC qp's).

In any event, uninstalling the SUSE RPMs fixed the problem.

Finally, the OFED installation script now checks for the SUSE-packaged drivers as well, so that if they are
present, they will be uninstalled when installing the OFED-packaged drivers. (this fix will be in
OFED 1.4-rc4, to be released this week).

- Jack

On Tuesday 28 October 2008 00:38, Rick Warner wrote:
> Hi all,
> 
> I am configuring an opteron cluster with connectX Infiniband.  I have a 
> problem that if I run one of the NAS tests, it works the first, and maybe 2nd 
> time, but after that the jobs instantly fail with messages like this-
> 
> [Rank 44][cm.c: line 860]poll CQ failed -2
> [Rank 51][cm.c: line 860]poll CQ failed -2
> [Rank 119][cm.c: line 860]poll CQ failed -2
> [Rank 85][cm.c: line 860]poll CQ failed -2
> [Rank 0][cm.c: line 860]poll CQ failed -2
> [Rank 9][cm.c: line 860]poll CQ failed -2
> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860]
> poll CQ failed -2
> [Rank 94][cm.c: line 860]poll CQ failed -2
> [Rank 111][cm.c: line 860]poll CQ failed -2
> 
> I can easily reproduce this with only 2 systems using a 16 process LU job, 
> class B.
> 
> Here are the configs I've tried-
> Suse 11 with distro provided IB driver and libraries,etc, using mvapich as 
> provided by ohio state
> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich
> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3
> 
> They all have the same basic problem.  I think one of them reported "Error 
> polling CQ" instead of "poll CQ failed".
> 
> If I replace the connectX cards with regular DDR cards the problem goes away.
> 
> I'm getting quite stumped at this point and would appreciate any suggestions 
> or patches.
> 
> Thanks,
> Rick



More information about the general mailing list