[libfabric-users] MPI_Barrier hang with sockets provider

Ilango, Arun arun.ilango at intel.com
Thu Mar 21 11:23:23 PDT 2019


Hi John,

I'm not sure what's going wrong with sockets provider but can you try your test with "tcp;ofi_rxm" provider stack? Sockets provider is currently in maintenance mode and "tcp;ofi_rxm" is preferred for running apps over TCP sockets. Please let me know if you face any issues.

Your configure and mpirun command line seems fine to me.

> Is there a basic "how to" for configuring and running MPI with
> the libfabric sockets provider? Getting started with libfabric is not easy!

If it isn't available already, adding a readme for running MPI with libfabric would be a good idea. Let me check. Are there any other issues you see when getting started with libfabric? There are various resources available in the Readme on libfabric GitHub page (https://github.com/ofiwg/libfabric).

Thanks,
Arun.

From: Libfabric-users [mailto:libfabric-users-bounces at lists.openfabrics.org] On Behalf Of Wilkes, John
Sent: Thursday, March 21, 2019 7:03 AM
To: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] MPI_Barrier hang with sockets provider

I have hardly any experience with libfabric, and I would greatly appreciate some suggestions!

Am I not configuring libfabric properly? Am I missing something on the mpirun command line?

Is there a basic "how to" for configuring and running MPI with the libfabric sockets provider? Getting started with libfabric is not easy!

John

From: Wilkes, John
Sent: Wednesday, March 20, 2019 8:16 AM
To: libfabric-users at lists.openfabrics.org<mailto:libfabric-users at lists.openfabrics.org>
Cc: Wilkes, John <John.Wilkes at amd.com<mailto:John.Wilkes at amd.com>>
Subject: MPI_Barrier hang with sockets provider

When I run the XSBench proxy app on 4 nodes, it finishes successfully, but when I run it with the libfabric sockets provider, it hangs.  After
the simulation is complete, there are calls to MPI_Barrier(), MPI_Reduce(), and MPI_Finalize().

command line:
$ mpirun -np 4 --map-by node --hostfile /nfs/mpi/etc/mpi-hostfile --mca mtl_ofi_provider_include sockets /nfs/software/proxy_apps/XSBench-14/src/XSBench -t 1 -s small

XSBench with the sockets provider runs to completion (does not hang) with -np 3.

OpenMPI-4.0.0
$ ./configure --prefix=/nfs/mpi --with-libfabric=/nfs/mpi --enable-orterun-prefix-by-default --disable-verbs-sshmem --without-verbs --enable-debug CFLAGS="-I/nfs/mpi/include -g -L/nfs/mpi/lib -ggdb -O0"

libfabric-1.7.0
$ ./configure --prefix=/nfs/mpi --enable-sockets=yes --enable-verbs=no --enable-debug=yes CFLAGS="-ggdb -O0"

A gdb stack trace on each node shows that node0 (where mpirun was run) is stuck in MPI_Reduce().  Node1 and node2 are in MPI_Finalize(), and node3 is in MPI_Barrier(). This is one example; the node that hangs in MPI_Barrier varies from run to run.

Node3 stack trace:

#0 fi_gettime_ms
#1 sock_cq_sreadfrom
#2 sock_cq_readfrom
#3 sock_cq_read
#4 fi_cq_read
#5 ompi_mtl_ofi_progress
#6 ompi_mtl_ofi_progress_no_inline
#7 opal_progress
#8 ompi_request_wait_completion
#9 ompi_request_default_wait
#10 ompi_coll_base_sendrecv_zero
#11 ompi_coll_base_barrier_intra_recursivedoubling
#12 ompi_coll_tuned_barrier_intra_dec_fixed
#13 PMPI_Barrier
#14 print_results
#15 main

Node0 stack trace:

#0 ??
#1 gettimeofday
#2 fi_gettime_ms
#3 sock_cq_sreadfrom
#4 sock_cq_readfrom
#5 sock_cq_read
#6 fi_cq_read
#7 ompi_mtl_ofi_progress
#8 ompi_mtl_ofi_progress_no_inline
#9 opal_progress
#10 ompi_request_wait_completion
#11 mca_pml_cm_recv
#12 ompi_coll_base_reduce_intra_basic_linear
#13 ompi_coll_tuned_reduce_intra_dec_fixed
#14 PMPI_Reduce
#15 print_results
#16 main

Node1 stack trace:
#0 __GI___nanosleep
#1 usleep
#2 ompi_mpi_finalize
#3 PMPI_Finalize
#4 main

Node2 stack trace:
#0 __GI___nanosleep
#1 usleep
#2 ompi_mpi_finalize
#3 PMPI_Finalize
#4 main

--
John Wilkes | AMD Research |  john.wilkes at amd.com<mailto:john.wilkes at amd.com> | office: +1 425.586.6412 (x26412)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190321/05606bdb/attachment-0001.html>


More information about the Libfabric-users mailing list