[ewg] need HELP with: error polling LP CQ with status RETRY EXCEEDED ERROR status number 12

rdauria at ucla.edu rdauria at ucla.edu
Tue Jun 16 18:13:05 PDT 2009


Dear All,

I have installed OFED 1.4.1 and separately built the openMPI 1.3.2 on  
a mixed fabric IB cluster (mostly Mellanox cards w/ mthca0 module and  
some q_Logic with ipath0 modiule -see also later in message). When I  
run the examples in the openMPI distribution or the OSU MPI tests I  
get no error or warning as long as I am performing the tests on  
Mellanox to Mellanox nodes, as soon as I add to the hostlist a qlogic  
node I get:

- for example running ring_cxx on 16 procs (8 on a Mellanox node 8 on  
a qLogic node) gives:
___________________________________________________________________________________________
Process 0 sending 10 to 1, tag 201 (16 processes in ring)
Process 0 sent to 1
--------------------------------------------------------------------------
WARNING: The btl_openib_max_inline_data MCA parameter was used to
specify how much inline data should be used, but a device reduced this
value.  This is not an error; it simply means that your run will use
a smaller inline data value than was requested.

   Local host:           n240
   Local device:         ipath0
   Requested value:      128
   Value used by device: 0
--------------------------------------------------------------------------
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 3 exiting
Process 1 exiting
Process 2 exiting
Process 4 exiting
Process 6 exiting
Process 7 exiting
Process 8 exiting
Process 10 exiting
Process 12 exiting
Process 13 exiting
Process 15 exiting
Process 9 exiting
Process 11 exiting
Process 14 exiting
Process 5 exiting
[n200:20836] 3 more processes have sent help message  
help-mpi-btl-openib-cpc-base.txt / inline truncated
[n200:20836] Set MCA parameter "orte_base_help_aggregate" to 0 to see  
all help / error messages
___________________________________________________________________________________________


I think I know how to deal with the WARNING above by setting  
"btl_openib_max_inline_data = 0" in the  
$MPI_HOME/etc/openmpi-mca-params.conf

However if I try to run other parallel applications (compiled with no  
problems, for example, with the installed mpic++) I get no run time  
error if I use solely Mellanox nodes but as soon as I add to the  
hostlist one qLogic node I get the following (here I had 8 procs on a  
Mellanox node and 8 on a q-Logic one):

___________________________________________________________________________________________
Number of processes = 16
Alltoall data size per process = 128 MB
--------------------------------------------------------------------------
WARNING: The btl_openib_max_inline_data MCA parameter was used to
specify how much inline data should be used, but a device reduced this
value.  This is not an error; it simply means that your run will use
a smaller inline data value than was requested.

   Local host:           n240
   Local device:         ipath0
   Requested value:      128
   Value used by device: 0
--------------------------------------------------------------------------
[[18933,1],15][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2929:handle_wc] from n240 to: n147 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 141655936 opcode 0  vendor error 0 qp_idx  
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

     The total number of times that the sender wishes the receiver to
     retry timeout, packet sequence, etc. errors before posting a
     completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
   attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
   to 10).  The actual timeout value used is calculated as:

      4.096 microseconds * (2^btl_openib_ib_timeout)

   See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the
peer to which it was connected:

   Local host:   n240
   Local device: ipath0
   Peer host:    n147

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 15 with PID 479 on
node n240 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[i01:19833] 95 more processes have sent help message  
help-mpi-btl-openib-cpc-base.txt / inline truncated
[i01:19833] Set MCA parameter "orte_base_help_aggregate" to 0 to see  
all help / error messages
___________________________________________________________________________________________

Please note that I get the same runtime error even if I run the  
application solely on q-Logic to q-Logic nodes. Which seems to  
indicate that this is not a mixed fabric problem but rather an issue  
with the qlogic nodes.

Please note that the majority of the nodes have Mellanox IB cards,  
this is what I get from ibv_devinfo:
___________________________________________________________________________________________
hca_id: mthca0
         fw_ver:                         1.2.0
         node_guid:                      0006:6a00:9800:7a46
         sys_image_guid:                 0006:6a00:9800:7a46
         vendor_id:                      0x02c9
         vendor_part_id:                 25204
         hw_ver:                         0xA0
         board_id:                       MT_0230000001
         phys_port_cnt:                  1
                 port:   1
                         state:                  PORT_ACTIVE (4)
                         max_mtu:                2048 (4)
                         active_mtu:             2048 (4)
                         sm_lid:                 2
                         port_lid:               132
                         port_lmc:               0x00
___________________________________________________________________________________________

Few nodes have q-logic cards, and on any of these nodes ibv_devinfo gives:
___________________________________________________________________________________________
hca_id: ipath0
         fw_ver:                         0.0.0
         node_guid:                      0011:7500:00ff:758c
         sys_image_guid:                 0011:7500:00ff:758c
         vendor_id:                      0x1077
         vendor_part_id:                 29216
         hw_ver:                         0x2
         board_id:                       InfiniPath_QLE7240
         phys_port_cnt:                  1
                 port:   1
                         state:                  PORT_ACTIVE (4)
                         max_mtu:                4096 (5)
                         active_mtu:             2048 (4)
                         sm_lid:                 2
                         port_lid:               328
                         port_lmc:               0x00
___________________________________________________________________________________________

This is what I get from ibstat:
___________________________________________________________________________________________
CA 'mthca0'
         CA type: MT25204
         Number of ports: 1
         Firmware version: 1.2.936
         Hardware version: a0
         Node GUID: 0x0002c9020027c650
         System image GUID: 0x0002c9020027c653
         Port 1:
                 State: Active
                 Physical state: LinkUp
                 Rate: 10
                 Base lid: 209
                 LMC: 0
                 SM lid: 2
                 Capability mask: 0x02510a68
                 Port GUID: 0x0002c9020027c651
___________________________________________________________________________________________

and:
___________________________________________________________________________________________
CA 'ipath0'
         CA type: InfiniPath_QLE7240
         Number of ports: 1
         Firmware version:
         Hardware version: 2
         Node GUID: 0x0011750000ff758c
         System image GUID: 0x0011750000ff758c
         Port 1:
                 State: Active
                 Physical state: LinkUp
                 Rate: 20
                 Base lid: 328
                 LMC: 0
                 SM lid: 2
                 Capability mask: 0x03010800
                 Port GUID: 0x0011750000ff758c
___________________________________________________________________________________________

ofed_info gives:
___________________________________________________________________________________________
OFED-1.4.1
libibverbs:
git://git.openfabrics.org/ofed_1_4/libibverbs.git ofed_1_4
commit b00dc7d2f79e0660ac40160607c9c4937a895433
libmthca:
git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git master
commit be5eef3895eb7864db6395b885a19f770fde7234
libmlx4:
git://git.openfabrics.org/ofed_1_4/libmlx4.git ofed_1_4
commit d5e5026e2bd3bbd7648199a48c4245daf313aa48
libehca:
git://git.openfabrics.org/ofed_1_4/libehca.git ofed_1_4
commit 0249815e9b6f134f33546da6fa2e84e1185eea6d
libipathverbs:
git://git.openfabrics.org/~ralphc/libipathverbs ofed_1_4
commit 337df3c1cbe43c3e9cb58e7f6e91f44603dd23fb
libcxgb3:
git://git.openfabrics.org/~swise/libcxgb3.git ofed_1_4
commit f685c8fe7e77e64614d825e563dd9f02a0b1ae16
libnes:
git://git.openfabrics.org/~glenn/libnes.git master
commit 379cccb4484f39b99c974eb6910d3a0407c0bbd1
libibcm:
git://git.openfabrics.org/~shefty/libibcm.git master
commit 7fb57e005b3eae2feb83b3fd369aeba700a5bcf8
librdmacm:
git://git.openfabrics.org/~shefty/librdmacm.git master
commit 62c2bddeaf5275425e1a7e3add59c3913ccdb4e9
libsdp:
git://git.openfabrics.org/ofed_1_4/libsdp.git ofed_1_4
commit b1eaecb7806d60922b2fe7a2592cea4ae56cc2ab
sdpnetstat:
git://git.openfabrics.org/~amirv/sdpnetstat.git ofed_1_4
commit 798e44f6d5ff8b15b2a86bc36768bd2ad473a6d7
srptools:
git://git.openfabrics.org/~ishai/srptools.git master
commit ce1f64c8dd63c93d56c1cc5fbcdaaadd4f74a1e3
perftest:
git://git.openfabrics.org/~orenmeron/perftest.git master
commit 1cd38e844dc50d670b48200bcda91937df5f5a92
qlvnictools:
git://git.openfabrics.org/~ramachandrak/qlvnictools.git ofed_1_4
commit 4ce9789273896d0e67430c330eb3703405b59951
tvflash:
git://git.openfabrics.org/ofed_1_4/tvflash.git ofed_1_4
commit e1b50b3b8af52b0bc55b2825bb4d6ce699d5c43b
mstflint:
git://git.openfabrics.org/~orenk/mstflint.git master
commit 3352f8997591c6955430b3e68adba33e80a974e3
qperf:
git://git.openfabrics.org/~johann/qperf.git/.git master
commit 18e1c1e8af96cd8bcacced3c4c2a4fd90f880792
ibutils:
git://git.openfabrics.org/~kliteyn/ibutils.git ofed_1_4
commit 9d4bfc3ba19875dfa4583dfaef6f0f579bb013bb
ibsim:
git://git.openfabrics.org/ofed_1_4/ibsim.git ofed_1_4
commit a76132ae36dde8302552d896e35bd29608ac9524

ofa_kernel-1.4.1:
Git:
git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
commit 868661b127c355c64066a796460a7380a722dd84

# MPI
mvapich-1.1.0-3355.src.rpm
mvapich2-1.2p1-1.src.rpm
openmpi-1.3.2-1.src.rpm
mpitests-3.1-891.src.rpm
___________________________________________________________________________________________

Any idea of what is wrong here?

Thanks,

Raffaella.



More information about the ewg mailing list