[ofiwg] [tac] OFED compatibility issue with Qlogic Infiniband card
todd.rimmer at intel.com
Thu Aug 18 04:42:21 PDT 2016
This error means the combination of timeout/retry configured for the QP has been exceeded. It could mean:
1. The destination does not exist on the IB fabric
2. The destination address was not properly resolved and the wrong IB address is being used
3. The fabric path between this host and the remote host is not stable or has a high symbol error rate
If this message occurs at the start of the job, it is likely #1 or #2. If it occurs later in the job after some traffic has been successfully sent between those nodes, #3 is more likely.
Since you indicate it is affecting performance, I assume it occurs mid-job, so #3 is more likely. In which case you should use tools like ibping and ibdiagnet to analyze the errors in the fabric or better yet you can use the Intel True Scale IB Fabric Suite (contact your HW supplier or Intel if you do not have this, it includes a rich set of fabric analysis and diagnostic tools)
Once you resolve this connectivity issue, for the best MPI performance on QLogic Infiniband cards it is recommended to use openmpi’s psm mtl as opposed to the verbs btl.
If the problem is #3, you should work with the distributor you purchased your hardware from for further debug of the faulty component.
Voice: 610-312-2152 Fax: 610-312-2233
Todd.Rimmer at intel.com<mailto:Todd.Rimmer at intel.com>
From: tac [mailto:tac-bounces at lists.openfabrics.org] On Behalf Of Woodruff, Robert J
Sent: Monday, August 15, 2016 12:28 PM
To: nv8840 at rit.edu; interop-wg at lists.openfabrics.org; ofiwg at lists.openfabrics.org; tac at lists.openfabrics.org; ewg at lists.openfabrics.org
Cc: Marciniszyn, Mike
Subject: Re: [tac] [ofiwg] OFED compatibility issue with Qlogic Infiniband card
+ Mike from the Intel InfiniBand driver team.
From: ofiwg [mailto:ofiwg-bounces at lists.openfabrics.org] On Behalf Of nv8840 at rit.edu<mailto:nv8840 at rit.edu>
Sent: Monday, August 15, 2016 7:50 AM
To: interop-wg at lists.openfabrics.org<mailto:interop-wg at lists.openfabrics.org>; ofiwg at lists.openfabrics.org<mailto:ofiwg at lists.openfabrics.org>; tac at lists.openfabrics.org<mailto:tac at lists.openfabrics.org>; ewg at lists.openfabrics.org<mailto:ewg at lists.openfabrics.org>
Subject: [ofiwg] OFED compatibility issue with Qlogic Infiniband card
I am trying to work with CentOS 6.8 and Qlogic Corp. IBA6110 Infiniband HCA(rev 3) while testing I receive following error message:
[[23581,1],4][btl_openib_component.c:3369:handle_wc] from n003 to: 192.168.2.5 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 7057fe8 opcode 2 vendor error 0 qp_idx 3
and increasing btl_openib_ib_timeout had no effect.
I am not sure what it means and how can it be resolved as it is affecting my performance. I am using infiniband support package from centos 6.8 for configuration of drivers.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ofiwg