[libfabric-users] libfabric hangs on QEMU/KVM virtual cluster

Wilkes, John John.Wilkes at amd.com
Mon Feb 5 13:18:10 PST 2018

I have a four node cluster of QEMU/KVM virtual machines. I installed MPICH-3.2 and ran the mpi-hello-world program with no problem.

I installed libfabric-1.5.3 and ran fabtests-1.5.3:

$ $PWD/runfabtests.sh -p /nfs/fabtests/bin sockets

And all tests pass:

# --------------------------------------------------------------
# Total Pass                                                73
# Total Notrun                                               0
# Total Fail                                                 0
# Percentage of Pass                                       100
# --------------------------------------------------------------

I rebuilt MPICH after configuring it to use libfabric. I recompiled the mpi-hello-world program. When I run mpi-hello-world with libfabric, it prints the "hello" message from all four nodes but hangs in MPI_Finalize.

I rebuilt libfabric and MPICH with debugging enabled and generated a log file when running mpi-hello-world on just two nodes (i.e. using "-n 2" instead of "-n 4"). The log file indicates that it is stuck "Waiting for 1 close operations", repeating "MPID_nem_ofi_poll" over and over until I stop the program with control-C:
  <"MPID_nem_ofi_poll"(3e-06)         src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
  >"MPID_nem_ofi_poll"                src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
   >"MPID_nem_ofi_cts_send_callback"  src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[188]
    >"MPID_nem_ofi_handle_packet"     src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[167]
    <"MPID_nem_ofi_handle_packet"(3e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[175]
   <"MPID_nem_ofi_cts_send_callback"(9e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[191]
   >"MPID_nem_ofi_data_callback"      src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[124]
   <"MPID_nem_ofi_data_callback"(3e-06) src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[173]
  <"MPID_nem_ofi_poll"(0.00404)       src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
<MPIDI_CH3I_PROGRESS(0.00796)        src/mpid/ch3/channels/nemesis/src/ch3_progress.c[659]
Waiting for 1 close operations       src/mpid/ch3/src/ch3u_handle_connection.c[382]
>MPIDI_CH3I_PROGRESS                 src/mpid/ch3/channels/nemesis/src/ch3_progress.c[424]
  >"MPID_nem_ofi_poll"                src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
  <"MPID_nem_ofi_poll"(3e-06)         src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]

I get the same behavior with OpenVPN; mpi-hello-world prints the "hello" message from all four nodes and hangs. Without libfabric, it runs normally.

Is there a known issue with libfabric on a QEMU/KVM virtual cluster? It seems like this should work.

John Wilkes | AMD Research |  john.wilkes at amd.com<mailto:john.wilkes at amd.com> | office: +1 425.586.6412 (x26412)

