[libfabric-users] libfabric 1.13 and Intel True Scale Fabric
judelga
judelga at cicese.mx
Tue Oct 11 09:59:41 PDT 2022
Hi,
We have a linux Centos 7.9 Cluster with Intel True Scale Fabric Edge
Managed Switch QDR InfiniBand so our infiniband devices are qib0
$ ibstat
CA 'qib0'
CA type: InfiniPath_QLE7340
Number of ports: 1
Firmware version:
Hardware version: 2
Node GUID: 0x00117500006f7990
System image GUID: 0x00117500006f7990
We installed Intel OneApi Toolkits 2021 with libfabric 1.13 but we are
not getting the expected scalability. Is very slow the MPI inter-node
communication.
When I run fi_info --list I get:
$ fi_info --list
psm2:
version: 113.0
psm3:
version: 1101.0
ofi_rxm:
version: 113.0
verbs:
version: 113.0
tcp:
version: 113.0
sockets:
version: 113.0
shm:
version: 113.0
ofi_hook_noop:
version: 113.0
I know that when I'am running MPI programs on a cluster with Intel True
Scale HCAs is recommended the use PSM interface but the MPI startup
program is taking tcp;ofi provider by default, and the running MPI
program is very slowly !!
] MPI startup(): libfabric version: 1.13.0-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): detected tcp;ofi_rxm provider, set device name to
"tcp-ofi-rxm"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0,
enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 16
[0] MPI startup(): File
"/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm_40.dat"
not found
[0] MPI startup(): Load tuning file:
"/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm.dat"
:
:
Any suggestions to improve MPI inter-node communication?
Any help would be aprecciated, I dont have enough experience with
libfabric.
Thanks.
- Julian.
More information about the Libfabric-users
mailing list