[libfabric-users] libfabric 1.13 and Intel True Scale Fabric

judelga judelga at cicese.mx
Tue Oct 11 09:59:41 PDT 2022


Hi,

We have a linux Centos 7.9 Cluster with Intel True Scale Fabric Edge 
Managed Switch QDR InfiniBand  so our infiniband devices are qib0

$ ibstat
CA 'qib0'
     CA type: InfiniPath_QLE7340
     Number of ports: 1
     Firmware version:
     Hardware version: 2
     Node GUID: 0x00117500006f7990
     System image GUID: 0x00117500006f7990


We installed Intel OneApi Toolkits 2021 with libfabric 1.13 but we are 
not getting the expected scalability. Is very slow the  MPI inter-node 
communication.


  When I run fi_info --list I get:

$ fi_info --list
psm2:
     version: 113.0
psm3:
     version: 1101.0
ofi_rxm:
     version: 113.0
verbs:
     version: 113.0
tcp:
     version: 113.0
sockets:
     version: 113.0
shm:
     version: 113.0
ofi_hook_noop:
     version: 113.0


I know that when I'am running MPI programs on a cluster with Intel True 
Scale HCAs is recommended the use PSM interface but the MPI startup  
program is taking tcp;ofi provider by default, and the running MPI 
program is very slowly !!

] MPI startup(): libfabric version: 1.13.0-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): detected tcp;ofi_rxm provider, set device name to 
"tcp-ofi-rxm"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, 
enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 16
[0] MPI startup(): File 
"/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm_40.dat" 
not found
[0] MPI startup(): Load tuning file: 
"/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm.dat"
:
:


Any suggestions to improve MPI inter-node communication?

Any help would be aprecciated, I dont have enough experience with 
libfabric.


Thanks.

- Julian.



More information about the Libfabric-users mailing list