[libfabric-users] libfabric 1.13 and Intel True Scale Fabric
judelga
judelga at cicese.mx
Wed Oct 12 14:21:27 PDT 2022
Thanks Sean.
I already had success in executing MPI program with PSM provider. I
installed libfabric version 1.8.1, but I cant run it through a torque
PBS job script ? Only using mpiexec at the command prompt.
I have Linux Centos 7.9 running torque PBS version 4.2.1
This is my error message:
libfabric:31383:core:core:fi_fabric_():1084<info> Opened fabric: psm
node1.31387Driver initialization failure on /dev/ipath (err=23)
node1.31383Driver initialization failure on /dev/ipath (err=23)
node1.31385Driver initialization failure on /dev/ipath (err=23)
Abort(1091215) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1716): OFI fi_open domain failed
(ofi_init.c:1716:MPIDI_OFI_mpi_init_hook:Input/output error)
node 1:
ls -la /dev/ipath*
crw-rw-rw- 1 root root 240, 0 Oct 3 09:50 /dev/ipath
crw-rw-rw- 1 root root 240, 1 Oct 3 09:50 /dev/ipath0
crw-rw-rw- 1 root root 240, 129 Oct 3 09:50 /dev/ipath_diag0
crw-rw-rw- 1 root root 240, 128 Oct 3 09:50 /dev/ipath_diagpkt
ls -la /dev/infiniband
total 0
drwxr-xr-x 2 root root 120 Oct 3 09:50 .
drwxr-xr-x 21 root root 3400 Oct 3 10:24 ..
crw-rw-rw- 1 root root 231, 64 Oct 3 09:50 issm0
crw-rw-rw- 1 root root 10, 58 Oct 3 09:50 rdma_cm
crw-rw-rw- 1 root root 231, 0 Oct 3 09:50 umad0
crw-rw-rw- 1 root root 231, 192 Oct 3 09:50 uverbs0
Any suggestions?
Thanks
Regards.
Julian.
On 11/10/22 10:25, Hefty, Sean wrote:
> I don't know if Truescale is supported anymore. That fi_info isn't showing you psm as an option is an indication that either there's a problem in the system setup or the provider was removed from the oneAPI toolkits.
>
> Intel MPI has its own internal version of libfabric. In theory, it should pick the best performing option for your fabric. You can try forcing the use of a provider using "export I_MPI_OFI_PROVIDER=psm". This won't work until you can get fi_info to report that provider.
>
> If you install a custom version of libfabric, you should be able to tell MPI to use that external version (I_MPI_OFI_LIBRARY_INTERNAL=0). This would let you use a newer version of libfabric, though there haven't been updates to the psm provider in a while. But if oneAPI is removing psm from its release, this will give you that option back.
>
> - Sean
>
>
>> We have a linux Centos 7.9 Cluster with Intel True Scale Fabric Edge
>> Managed Switch QDR InfiniBand so our infiniband devices are qib0
>>
>> $ ibstat
>> CA 'qib0'
>> CA type: InfiniPath_QLE7340
>> Number of ports: 1
>> Firmware version:
>> Hardware version: 2
>> Node GUID: 0x00117500006f7990
>> System image GUID: 0x00117500006f7990
>>
>>
>> We installed Intel OneApi Toolkits 2021 with libfabric 1.13 but we are
>> not getting the expected scalability. Is very slow the MPI inter-node
>> communication.
>>
>>
>> When I run fi_info --list I get:
>>
>> $ fi_info --list
>> psm2:
>> version: 113.0
>> psm3:
>> version: 1101.0
>> ofi_rxm:
>> version: 113.0
>> verbs:
>> version: 113.0
>> tcp:
>> version: 113.0
>> sockets:
>> version: 113.0
>> shm:
>> version: 113.0
>> ofi_hook_noop:
>> version: 113.0
>>
>>
>> I know that when I'am running MPI programs on a cluster with Intel True
>> Scale HCAs is recommended the use PSM interface but the MPI startup
>> program is taking tcp;ofi provider by default, and the running MPI
>> program is very slowly !!
>>
>> ] MPI startup(): libfabric version: 1.13.0-impi
>> [0] MPI startup(): libfabric provider: tcp;ofi_rxm
>> [0] MPI startup(): detected tcp;ofi_rxm provider, set device name to
>> "tcp-ofi-rxm"
>> [0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0,
>> enable_shared_ctxs 0, do_av_insert 1
>> [0] MPI startup(): addrnamelen: 16
>> [0] MPI startup(): File
>> "/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm_40.dat"
>> not found
>> [0] MPI startup(): Load tuning file:
>> "/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm.dat"
>> :
>> :
>>
>>
>> Any suggestions to improve MPI inter-node communication?
>>
>> Any help would be aprecciated, I dont have enough experience with
>> libfabric.
>>
>>
>> Thanks.
>>
>> - Julian.
>>
>> _______________________________________________
>> Libfabric-users mailing list
>> Libfabric-users at lists.openfabrics.org
>> https://lists.openfabrics.org/mailman/listinfo/libfabric-users
More information about the Libfabric-users
mailing list