[ofa-general] rdma_read scale-up (8 ppn) issue with iMPI IMB alltoallv, MT25208 SDR
Arlin Davis
ardavis at ichips.intel.com
Fri Sep 28 13:42:01 PDT 2007
We are running into IBV_WC_RETRY_EXC_ERR errors with large rdma_reads
using iMPI and IMB alltoallv. Problem always occurs between processes on
the same node. Loopback issue?
Has anyone else run into rdma_read issues like this?
Here are details:
2 node Clovertown X5355 servers (8 cores each), RHEL4u4, iMPI 3.0.
retry_count is set to 7
[ardavis at compute-0-14 src]$ ibv_devinfo
hca_id: mthca0
fw_ver: 4.8.200
node_guid: 0002:c902:0000:4fa8
sys_image_guid: 0002:c902:0000:4fa8
vendor_id: 0x02c9
vendor_part_id: 25208
hw_ver: 0xA0
board_id: MT_00A0000001
phys_port_cnt: 2
[ardavis at compute-0-14 src]$ mpiexec -perhost 8 -n 8 -env DAPL_DBG_TYPE
0x83 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE rdma ./IMB-MPI1 alltoallv
-npmin 16
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.0, MPI-1 part
#---------------------------------------------------
# Date : Fri Sep 28 12:26:05 2007
# Machine : x86_64
# System : Linux
# Release : 2.6.9-42.ELsmp
# Version : #1 SMP Wed Jul 12 23:32:02 EDT 2006
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE
#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Alltoallv
#----------------------------------------------------------------
# Benchmarking Alltoallv
# #processes = 8
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.66 0.70 0.68
1 1000 483.97 484.49 484.32
2 1000 483.35 483.41 483.37
4 1000 484.29 484.41 484.39
8 1000 483.86 484.01 483.97
16 1000 479.72 479.87 479.82
32 1000 483.95 484.07 484.00
64 1000 482.13 482.27 482.22
128 1000 485.00 485.13 485.09
256 1000 485.93 486.06 486.00
512 1000 487.68 487.78 487.72
1024 1000 487.82 487.98 487.94
2048 1000 497.09 497.27 497.21
4096 1000 510.79 510.95 510.86
8192 1000 506.51 506.64 506.59
16384 1000 642.15 642.26 642.21
32768 1000 1816.55 1816.80 1816.67
65536 640 2926.42 2926.65 2926.51
131072 320 5214.20 5215.18 5214.64
262144 160 10018.31 10021.30 10020.22
524288 80 19554.79 19581.09 19573.01
1048576 40 43291.05 43342.45 43323.24
2097152 20 109898.01 110455.85 110361.47
DTO completion ERROR: 12: op 0x2
DTO completion ERROR: 12: op 0x2 (ep disconnected)
[0][rdma_iba.c:193] Intel MPI fatal error: DTO operation completed with
error. status=0x1. cookie=0x0
DTO completion ERROR: 5: op 0x2
[7][rdma_iba.c:193] Intel MPI fatal error: DTO operation completed with
error. status=0x8. cookie=0x4
Thanks,
-arlin
More information about the general
mailing list