[libfabric-users] libfabric 1.13 and Intel True Scale Fabric

judelga judelga at cicese.mx
Wed Oct 12 14:21:27 PDT 2022


Thanks Sean.

I already had success in executing MPI program with PSM provider. I 
installed libfabric version 1.8.1,  but I cant run it through a torque 
PBS job script ? Only using mpiexec at the command prompt.

I have Linux Centos 7.9 running torque PBS  version 4.2.1

This is my error message:

libfabric:31383:core:core:fi_fabric_():1084<info> Opened fabric: psm
node1.31387Driver initialization failure on /dev/ipath (err=23)
node1.31383Driver initialization failure on /dev/ipath (err=23)
node1.31385Driver initialization failure on /dev/ipath (err=23)
Abort(1091215) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Init: 
Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1716): OFI fi_open domain failed 
(ofi_init.c:1716:MPIDI_OFI_mpi_init_hook:Input/output error)


node 1:

ls -la /dev/ipath*
crw-rw-rw- 1 root root 240,   0 Oct  3 09:50 /dev/ipath
crw-rw-rw- 1 root root 240,   1 Oct  3 09:50 /dev/ipath0
crw-rw-rw- 1 root root 240, 129 Oct  3 09:50 /dev/ipath_diag0
crw-rw-rw- 1 root root 240, 128 Oct  3 09:50 /dev/ipath_diagpkt

  ls -la /dev/infiniband
total 0
drwxr-xr-x  2 root root      120 Oct  3 09:50 .
drwxr-xr-x 21 root root     3400 Oct  3 10:24 ..
crw-rw-rw-  1 root root 231,  64 Oct  3 09:50 issm0
crw-rw-rw-  1 root root  10,  58 Oct  3 09:50 rdma_cm
crw-rw-rw-  1 root root 231,   0 Oct  3 09:50 umad0
crw-rw-rw-  1 root root 231, 192 Oct  3 09:50 uverbs0


Any suggestions?

Thanks

Regards.

Julian.

On 11/10/22 10:25, Hefty, Sean wrote:
> I don't know if Truescale is supported anymore.  That fi_info isn't showing you psm as an option is an indication that either there's a problem in the system setup or the provider was removed from the oneAPI toolkits.
>
> Intel MPI has its own internal version of libfabric.  In theory, it should pick the best performing option for your fabric.  You can try forcing the use of a provider using "export I_MPI_OFI_PROVIDER=psm".  This won't work until you can get fi_info to report that provider.
>
> If you install a custom version of libfabric, you should be able to tell MPI to use that external version (I_MPI_OFI_LIBRARY_INTERNAL=0).  This would let you use a newer version of libfabric, though there haven't been updates to the psm provider in a while.  But if oneAPI is removing psm from its release, this will give you that option back.
>
> - Sean
>
>
>> We have a linux Centos 7.9 Cluster with Intel True Scale Fabric Edge
>> Managed Switch QDR InfiniBand  so our infiniband devices are qib0
>>
>> $ ibstat
>> CA 'qib0'
>>       CA type: InfiniPath_QLE7340
>>       Number of ports: 1
>>       Firmware version:
>>       Hardware version: 2
>>       Node GUID: 0x00117500006f7990
>>       System image GUID: 0x00117500006f7990
>>
>>
>> We installed Intel OneApi Toolkits 2021 with libfabric 1.13 but we are
>> not getting the expected scalability. Is very slow the  MPI inter-node
>> communication.
>>
>>
>>    When I run fi_info --list I get:
>>
>> $ fi_info --list
>> psm2:
>>       version: 113.0
>> psm3:
>>       version: 1101.0
>> ofi_rxm:
>>       version: 113.0
>> verbs:
>>       version: 113.0
>> tcp:
>>       version: 113.0
>> sockets:
>>       version: 113.0
>> shm:
>>       version: 113.0
>> ofi_hook_noop:
>>       version: 113.0
>>
>>
>> I know that when I'am running MPI programs on a cluster with Intel True
>> Scale HCAs is recommended the use PSM interface but the MPI startup
>> program is taking tcp;ofi provider by default, and the running MPI
>> program is very slowly !!
>>
>> ] MPI startup(): libfabric version: 1.13.0-impi
>> [0] MPI startup(): libfabric provider: tcp;ofi_rxm
>> [0] MPI startup(): detected tcp;ofi_rxm provider, set device name to
>> "tcp-ofi-rxm"
>> [0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0,
>> enable_shared_ctxs 0, do_av_insert 1
>> [0] MPI startup(): addrnamelen: 16
>> [0] MPI startup(): File
>> "/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm_40.dat"
>> not found
>> [0] MPI startup(): Load tuning file:
>> "/NFS/opt/intel/oneapi/mpi/2021.4.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm.dat"
>> :
>> :
>>
>>
>> Any suggestions to improve MPI inter-node communication?
>>
>> Any help would be aprecciated, I dont have enough experience with
>> libfabric.
>>
>>
>> Thanks.
>>
>> - Julian.
>>
>> _______________________________________________
>> Libfabric-users mailing list
>> Libfabric-users at lists.openfabrics.org
>> https://lists.openfabrics.org/mailman/listinfo/libfabric-users


More information about the Libfabric-users mailing list