[libfabric-users] MPI_Barrier hang with sockets provider
Wilkes, John
John.Wilkes at amd.com
Wed Mar 20 08:16:17 PDT 2019
When I run the XSBench proxy app on 4 nodes, it finishes successfully, but when I run it with the libfabric sockets provider, it hangs. After
the simulation is complete, there are calls to MPI_Barrier(), MPI_Reduce(), and MPI_Finalize().
command line:
$ mpirun -np 4 --map-by node --hostfile /nfs/mpi/etc/mpi-hostfile --mca mtl_ofi_provider_include sockets /nfs/software/proxy_apps/XSBench-14/src/XSBench -t 1 -s small
XSBench with the sockets provider runs to completion (does not hang) with -np 3.
OpenMPI-4.0.0
$ ./configure --prefix=/nfs/mpi --with-libfabric=/nfs/mpi --enable-orterun-prefix-by-default --disable-verbs-sshmem --without-verbs --enable-debug CFLAGS="-I/nfs/mpi/include -g -L/nfs/mpi/lib -ggdb -O0"
libfabric-1.7.0
$ ./configure --prefix=/nfs/mpi --enable-sockets=yes --enable-verbs=no --enable-debug=yes CFLAGS="-ggdb -O0"
A gdb stack trace on each node shows that node0 (where mpirun was run) is stuck in MPI_Reduce(). Node1 and node2 are in MPI_Finalize(), and node3 is in MPI_Barrier(). This is one example; the node that hangs in MPI_Barrier varies from run to run.
Node3 stack trace:
#0 fi_gettime_ms
#1 sock_cq_sreadfrom
#2 sock_cq_readfrom
#3 sock_cq_read
#4 fi_cq_read
#5 ompi_mtl_ofi_progress
#6 ompi_mtl_ofi_progress_no_inline
#7 opal_progress
#8 ompi_request_wait_completion
#9 ompi_request_default_wait
#10 ompi_coll_base_sendrecv_zero
#11 ompi_coll_base_barrier_intra_recursivedoubling
#12 ompi_coll_tuned_barrier_intra_dec_fixed
#13 PMPI_Barrier
#14 print_results
#15 main
Node0 stack trace:
#0 ??
#1 gettimeofday
#2 fi_gettime_ms
#3 sock_cq_sreadfrom
#4 sock_cq_readfrom
#5 sock_cq_read
#6 fi_cq_read
#7 ompi_mtl_ofi_progress
#8 ompi_mtl_ofi_progress_no_inline
#9 opal_progress
#10 ompi_request_wait_completion
#11 mca_pml_cm_recv
#12 ompi_coll_base_reduce_intra_basic_linear
#13 ompi_coll_tuned_reduce_intra_dec_fixed
#14 PMPI_Reduce
#15 print_results
#16 main
Node1 stack trace:
#0 __GI___nanosleep
#1 usleep
#2 ompi_mpi_finalize
#3 PMPI_Finalize
#4 main
Node2 stack trace:
#0 __GI___nanosleep
#1 usleep
#2 ompi_mpi_finalize
#3 PMPI_Finalize
#4 main
--
John Wilkes | AMD Research | john.wilkes at amd.com<mailto:john.wilkes at amd.com> | office: +1 425.586.6412 (x26412)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190320/940b8117/attachment.html>
More information about the Libfabric-users
mailing list