[ewg] IB errors with openMPI

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Sun Feb 21 21:46:53 PST 2010


We are trying run openMPI with OFED-1.5 on the 2.6.31-rt11-preempt-rt kernel and see the following errors:

[[45393,1],8][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc]
from elm3b107 to: elm3b17 error polling HP CQ with status WORK REQUEST FLUSHED
ERROR status number 5 for wr_id 1289846528 opcode -1782678528  vendor error 244
qp_idx 0

At this point I looked at the mlx4 diag counters and saw some non-zero values. Since we were attempting 
a series of runs, we don't know when the counters increased from 0. Do these counters have any correlation 
to the above MPI error?

[root at elm3b17 diag_counters]# pwd
/sys/class/infiniband/mlx4_0/diag_counters
[root at elm3b17 diag_counters]#

[root at elm3b17 diag_counters]# cat rq_num_rnr
19
[root at elm3b17 diag_counters]# cat rq_num_wrfe 
2009
[root at elm3b17 diag_counters]# cat sq_num_tree 
12
[root at elm3b17 diag_counters]# cat sq_num_wrfe
12
[root at elm3b17 diag_counters]#

Similarly on 3b107 let us look at the counters.

[root at elm3b107 diag_counters]# cat rq_num_wrfe
5156
[root at elm3b107 diag_counters]# cat sq_num_rnr
18
[root at elm3b107 diag_counters]# cat sq_num_tree
20
[root at elm3b107 diag_counters]# cat sq_num_wrfe
20
[root at elm3b107 diag_counters]#


We are using ConnectX dual port DDR HCAs (FW version 2.6). What does the vendor error 244 mean? Any suggestions to 
debug this further?

Thanks
Pradeep




More information about the ewg mailing list