[libfabric-users] libfabric hangs on QEMU/KVM virtual cluster
Wilkes, John
John.Wilkes at amd.com
Mon Feb 5 13:18:10 PST 2018
I have a four node cluster of QEMU/KVM virtual machines. I installed MPICH-3.2 and ran the mpi-hello-world program with no problem.
I installed libfabric-1.5.3 and ran fabtests-1.5.3:
$ $PWD/runfabtests.sh -p /nfs/fabtests/bin sockets 192.168.100.201 192.168.100.203
And all tests pass:
# --------------------------------------------------------------
# Total Pass 73
# Total Notrun 0
# Total Fail 0
# Percentage of Pass 100
# --------------------------------------------------------------
I rebuilt MPICH after configuring it to use libfabric. I recompiled the mpi-hello-world program. When I run mpi-hello-world with libfabric, it prints the "hello" message from all four nodes but hangs in MPI_Finalize.
I rebuilt libfabric and MPICH with debugging enabled and generated a log file when running mpi-hello-world on just two nodes (i.e. using "-n 2" instead of "-n 4"). The log file indicates that it is stuck "Waiting for 1 close operations", repeating "MPID_nem_ofi_poll" over and over until I stop the program with control-C:
...
<"MPID_nem_ofi_poll"(3e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
>"MPID_nem_ofi_poll" src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
>"MPID_nem_ofi_cts_send_callback" src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[188]
>"MPID_nem_ofi_handle_packet" src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[167]
<"MPID_nem_ofi_handle_packet"(3e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[175]
<"MPID_nem_ofi_cts_send_callback"(9e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[191]
>"MPID_nem_ofi_data_callback" src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[124]
<"MPID_nem_ofi_data_callback"(3e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[173]
<"MPID_nem_ofi_poll"(0.00404) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
<MPIDI_CH3I_PROGRESS(0.00796) src/mpid/ch3/channels/nemesis/src/ch3_progress.c[659]
Waiting for 1 close operations src/mpid/ch3/src/ch3u_handle_connection.c[382]
>MPIDI_CH3I_PROGRESS src/mpid/ch3/channels/nemesis/src/ch3_progress.c[424]
>"MPID_nem_ofi_poll" src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
<"MPID_nem_ofi_poll"(3e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
...
I get the same behavior with OpenVPN; mpi-hello-world prints the "hello" message from all four nodes and hangs. Without libfabric, it runs normally.
Is there a known issue with libfabric on a QEMU/KVM virtual cluster? It seems like this should work.
--
John Wilkes | AMD Research | john.wilkes at amd.com<mailto:john.wilkes at amd.com> | office: +1 425.586.6412 (x26412)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20180205/496127eb/attachment.html>
More information about the Libfabric-users
mailing list