[Users] Communication (Send/Recv) error according to message size(65535)
Kihang Youn
kyoun at lenovo.com
Sun Jan 3 22:08:39 PST 2021
Hello,
I am testing the newly upgraded OFED (5.1-0.6.6) and corresponding OpenMPI (4.0.2, 4.0.4).
I don't know for what reason, but I get a communication error. (There is no error in the combination of OFED(4.6-1.0.1) & OpenMPI(4.0.2))
When communicating between compute nodes(inter-nodes), if the size of send/recv messages exceeds 65535, the following error occurs.
This does not happen when using one compute node.
If there are any points worth checking, it would be appreciated if you could tell us even a trivial thing.
Best Regards,
Kihang
Part of the error message:
[pduru18:351568:0:351568] ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_2:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[pduru18:351568:0:351568] ib_mlx5_log.c:143 RC QP 0x139d4 wqe[0]: RDMA_READ s-- [rva 0x2b9827e90a40 rkey 0x182ab] [va 0x2b270e05ca00 len 219136 lkey 0x3c2b]
[pduru18:351565:0:351565] ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_2:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[pduru18:351565:0:351565] ib_mlx5_log.c:143 RC QP 0x139d3 wqe[0]: RDMA_READ s-- [rva 0x2ac9d73be980 rkey 0x8b395] [va 0x2b464c51bc00 len 223232 lkey 0x5e4b]
[pduru18:351571:0:351571] ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_2:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[pduru18:351571:0:351571] ib_mlx5_log.c:143 RC QP 0x139d2 wqe[0]: RDMA_READ s-- [rva 0x2b0072dd1980 rkey 0x55fea] [va 0x2b70590d8c00 len 223232 lkey 0x715b]
Executable file error message:
==== backtrace (tid: 351569) ====
0 0x000000000004ed85 ucs_debug_print_backtrace() ???:0
1 0x000000000001f9c2 uct_ib_mlx5_completion_with_err() ???:0
2 0x000000000002e736 uct_rc_mlx5_iface_is_reachable() ???:0
3 0x0000000000030481 uct_rc_mlx5_iface_progress() ???:0
4 0x0000000000022f3a ucp_worker_progress() ???:0
5 0x0000000000038574 opal_progress() /export/home/nwp/OFED_TEST/KMALIB/src/openmpi/openmpi-4.0.4/opal/runtime/opal_progress.c:231
6 0x00000000000569f7 ompi_request_wait_completion() /export/home/nwp/OFED_TEST/KMALIB/src/openmpi/openmpi-4.0.4/ompi/../ompi/request/request.h:415
7 0x00000000000569f7 ompi_request_default_wait() /export/home/nwp/OFED_TEST/KMALIB/src/openmpi/openmpi-4.0.4/ompi/request/req_wait.c:42
8 0x0000000000084772 PMPI_Wait() /export/home/nwp/OFED_TEST/KMALIB/src/openmpi/openmpi-4.0.4/ompi/mpi/c/profile/pwait.c:74
9 0x000000000005b26f ompi_wait_f() /export/home/nwp/OFED_TEST/KMALIB/src/openmpi/openmpi-4.0.4/ompi/mpi/fortran/mpif-h/profile/pwait_f.c:76
10 0x00000000005b1642 swap3d_() ???:0
11 0x00000000004a6eb4 hdiff_() ???:0
12 0x000000000046bf81 sciproc_() ???:0
13 0x0000000000462418 MAIN__() ???:0
14 0x000000000040bfde main() ???:0
15 0x00000000000223d5 __libc_start_main() ???:0
16 0x000000000040bee9 _start() ???:0
> ucx_info -v
# UCT version=1.9.0 revision 1d0a420
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni
> cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
> uname -a
Linux boot2 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> ofed_info -s
MLNX_OFED_LINUX-5.1-0.6.6.0:
> ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.1002
Hardware version: 0
Node GUID: 0xb8599f0300b84da6
System image GUID: 0xb8599f0300b84da6
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 4
LMC: 0
SM lid: 4
Capability mask: 0x2651e84a
Port GUID: 0xb8599f0300b84da6
Link layer: InfiniBand
> ibv_devinfo -v
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.28.1002
node_guid: b859:9f03:00b8:4da6
sys_image_guid: b859:9f03:00b8:4da6
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: LNV0000000016
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0xe97e1c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
UD_IP_CSUM
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
MANAGED_FLOW_STEERING
Unknown flags: 0xC8480000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 156250kHZ
device_cap_flags_ex: 0x30000051E97E1C36
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004100000000
tso_caps:
max_tso: 0
rss_caps:
max_rwq_indirection_tables: 0
max_rwq_indirection_table_size: 0
rx_hash_function: 0x0
rx_hash_fields_mask: 0x0
max_wq_type_rq: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
max_rndv_hdr_size: 64
max_num_tags: 127
max_ops: 32768
max_sge: 1
flags:
IBV_TM_CAP_RC
cq moderation caps:
max_cq_count: 65535
max_cq_period: 4095 us
maximum available device memory: 262144Bytes
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 4
port_lid: 4
port_lmc: 0x00
link_layer: InfiniBand
max_msg_sz: 0x40000000
port_cap_flags: 0x2251e84a
port_cap_flags2: 0x0032
max_vl_num: 4 (3)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 8
subnet_timeout: 18
init_type_reply: 0
active_width: 2X (16)
active_speed: 50.0 Gbps (64)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:b859:9f03:00b8:4da6
> ompi_info
Package: Open MPI root at boot2 Distribution
Open MPI: 4.0.2
Open MPI repo revision: v4.0.2
Open MPI release date: Oct 07, 2019
Open RTE: 4.0.2
Open RTE repo revision: v4.0.2
Open RTE release date: Oct 07, 2019
OPAL: 4.0.2
OPAL repo revision: v4.0.2
OPAL release date: Oct 07, 2019
MPI API: 3.1.0
Ident string: 4.0.2
Prefix: /d1/home/nwp/OFED_TEST/KMALIB/apps/openmpi/4.0.2_ofed
Configured architecture: x86_64-unknown-linux-gnu
Configure host: boot2
Configured by: root
Configured on: Wed Dec 16 06:21:49 UTC 2020
Configure host: boot2
Configure command line: 'CC=icc' 'CFLAGS=-m64' 'FC=ifort' 'FCFLAGS=-m64'
'--prefix=/d1/home/nwp/OFED_TEST/KMALIB/apps/openmpi/4.0.2_ofed'
'--with-platform=mellanox/optimized'
'--with-mxm=/opt/mellanox/mxm'
'--with-knem=/opt/knem-1.1.4.90mlnx1/'
'--with-zlib=/opt/kma/kma_lib/apps/zlib/1.2.11/'
'--with-zlib-libdir=/opt/kma/kma_lib/apps/zlib/1.2.11/lib'
'--with-lsf=/opt/ibm/lsfsuite/dcomp/lsf/10.1'
'--with-lsf-libdir=/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/lib/'
Built by: root
Built on: Wed Dec 16 06:29:41 UTC 2020
Built host: boot2
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the ifort compiler and/or Open MPI,
does not support the following: array subsections,
direct passthru (where possible) to underlying Open
MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: icc
C compiler absolute: /d1/home/nwp/OFED_TEST/KMALIB/apps/intel/20.2_ofed/compilers_and_libraries_2020/linux/bin/intel64/icc
C compiler family name: INTEL
C compiler version: 1910.20200623
C++ compiler: g++
C++ compiler absolute: /opt/kma/kma_lib/apps/gcc/7.5.0/bin/g++
Fort compiler: ifort
Fort compiler abs: /d1/home/nwp/OFED_TEST/KMALIB/apps/intel/20.2_ofed/compilers_and_libraries_2020/linux/bin/intel64/ifort
Fort ignore TKR: yes (!DEC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: never
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: yes
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: no
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
> ofed_info
MLNX_OFED_LINUX-5.1-0.6.6.0 (OFED-5.1-0.6.6):
ar_mgr:
osm_plugins/ar_mgr/ar_mgr-1.0-0.2.MLNX20200630.g8577618.tar.gz
dpcp:
/sw/release/sw_acceleration/dpcp/dpcp-1.0.0-1.src.rpm
dump_pr:
osm_plugins/dump_pr//dump_pr-1.0-0.2.MLNX20200630.g8577618.tar.gz
fabric-collector:
fabric_collector//fabric-collector-1.1.0.MLNX20170103.89bb2aa.tar.gz
hcoll:
mlnx_ofed_hcol/hcoll-4.6.3125-1.src.rpm
ibdump:
https://github.com/Mellanox/ibdump master
commit 6355ebbd664cafb629edeadecd4096ac2a0304c3
ibsim:
mlnx_ofed_ibsim/ibsim-0.9.tar.gz
ibutils2:
ibutils2/ibutils2-2.1.1-0.126.MLNX20200721.gf95236b.tar.gz
iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_1
commit c72091bb7f69243219dda60946342385c9766aa3
isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_1
commit c72091bb7f69243219dda60946342385c9766aa3
kernel-mft:
mlnx_ofed_mft/kernel-mft-4.15.0-104.src.rpm
knem:
knem.git mellanox-master
commit 299ba51259c0947b71b762567bccf660513f8643
libpka:
mlnx_ofed_soc/libpka-1.0-1.gcc98895.src.rpm
libvma:
vma/source_rpms/libvma-9.1.1-0.src.rpm
mlnx-dpdk:
https://github.com/Mellanox/dpdk.org mlnx_dpdk_19.11_last_stable
commit c8732df963abf855edf2447a0b8d8543e7924ba9
mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_1
commit c72091bb7f69243219dda60946342385c9766aa3
mlnx-ethtool:
mlnx_ofed/ethtool.git mlnx_ofed_5_1
commit a1f6f627af80b76b013b68ff57a3ae41ac7517f9
mlnx-iproute2:
mlnx_ofed/iproute2.git mlnx_ofed_5_1
commit 9a007c2d912ce52ad5e3e9c6a9bc9fb4d20fd52c
mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_1
commit c72091bb7f69243219dda60946342385c9766aa3
mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_1
commit c72091bb7f69243219dda60946342385c9766aa3
mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_1
commit c72091bb7f69243219dda60946342385c9766aa3
mlxbf-bootctl:
https://github.com/Mellanox/mlxbf-bootctl bluefield-rel/3.0
commit fda69b62ac4f2707a82da18f894b40120f686010
mpi-selector:
ofed-1.5.3-rpms/mpi-selector/mpi-selector-1.0.3-1.src.rpm
mpitests:
mlnx_ofed_mpitest/mpitests-3.2.20-5d20b49.src.rpm
mstflint:
mlnx_ofed_mstflint/mstflint-4.14.0-3.tar.gz
multiperf:
mlnx_ofed_multiperf/multiperf-3.0-0.14.g5f0fd0e.tar.gz
mxm:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.0-0.3.7/SRPMS/mxm-3.7.3112-1.50037.src.rpm
ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc
openmpi:
mlnx_ofed_ompi_1.8/openmpi-4.0.4rc3-1.src.rpm
opensm:
mlnx_ofed_opensm/opensm-5.7.0.MLNX20200721.7ccc6f6.tar.gz
openvswitch:
openvswitch.git mlnx_ofed_5_1
commit e8a86012636e058cfd48486c39afa8cbac9ed597
perftest:
mlnx_ofed_perftest/perftest-4.4-0.30.g9c50960.tar.gz
rdma-core:
mlnx_ofed/rdma-core.git mlnx_ofed_5_1
commit 77e7f704897a3bf94464d3c12ec508f1e26336fd
rshim:
https://github.com/Mellanox/rshim-user-space master
commit a70d84655d6e248141124bce1805f2c9b0426fe9
sharp:
mlnx_ofed_sharp/sharp-2.2.0.MLNX20200721.2fd570a.tar.gz
sockperf:
sockperf/sockperf-3.7-0.gita1e8e835a689.src.rpm
srp:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_1
commit c72091bb7f69243219dda60946342385c9766aa3
ucx:
mlnx_ofed_ucx/ucx-1.9.0-1.src.rpm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20210104/2a297dd5/attachment-0001.htm>
More information about the Users
mailing list