[libfabric-users] libfabric 1.13 and Intel True Scale Fabric
Hefty, Sean
sean.hefty at intel.com
Tue Oct 11 10:25:45 PDT 2022
I don't know if Truescale is supported anymore. That fi_info isn't showing you psm as an option is an indication that either there's a problem in the system setup or the provider was removed from the oneAPI toolkits.
Intel MPI has its own internal version of libfabric. In theory, it should pick the best performing option for your fabric. You can try forcing the use of a provider using "export I_MPI_OFI_PROVIDER=psm". This won't work until you can get fi_info to report that provider.
If you install a custom version of libfabric, you should be able to tell MPI to use that external version (I_MPI_OFI_LIBRARY_INTERNAL=0). This would let you use a newer version of libfabric, though there haven't been updates to the psm provider in a while. But if oneAPI is removing psm from its release, this will give you that option back.
- Sean
> We have a linux Centos 7.9 Cluster with Intel True Scale Fabric Edge
> Managed Switch QDR InfiniBand so our infiniband devices are qib0
>
> $ ibstat
> CA 'qib0'
> CA type: InfiniPath_QLE7340
> Number of ports: 1
> Firmware version:
> Hardware version: 2
> Node GUID: 0x00117500006f7990
> System image GUID: 0x00117500006f7990
>
>
> We installed Intel OneApi Toolkits 2021 with libfabric 1.13 but we are
> not getting the expected scalability. Is very slow the MPI inter-node
> communication.
>
>
> When I run fi_info --list I get:
>
> $ fi_info --list
> psm2:
> version: 113.0
> psm3:
> version: 1101.0
> ofi_rxm:
> version: 113.0
> verbs:
> version: 113.0
> tcp:
> version: 113.0
> sockets:
> version: 113.0
> shm:
> version: 113.0
> ofi_hook_noop:
> version: 113.0
>
>
> I know that when I'am running MPI programs on a cluster with Intel True
> Scale HCAs is recommended the use PSM interface but the MPI startup
> program is taking tcp;ofi provider by default, and the running MPI
> program is very slowly !!
>
> ] MPI startup(): libfabric version: 1.13.0-impi
> [0] MPI startup(): libfabric provider: tcp;ofi_rxm
> [0] MPI startup(): detected tcp;ofi_rxm provider, set device name to
> "tcp-ofi-rxm"
> [0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0,
> enable_shared_ctxs 0, do_av_insert 1
> [0] MPI startup(): addrnamelen: 16
> [0] MPI startup(): File
> "/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm_40.dat"
> not found
> [0] MPI startup(): Load tuning file:
> "/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm.dat"
> :
> :
>
>
> Any suggestions to improve MPI inter-node communication?
>
> Any help would be aprecciated, I dont have enough experience with
> libfabric.
>
>
> Thanks.
>
> - Julian.
>
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/libfabric-users
More information about the Libfabric-users
mailing list