[ofiwg] libfabric hangs on QEMU/KVM virtual cluster
Wilkes, John
John.Wilkes at amd.com
Mon Feb 5 14:21:46 PST 2018
Yes, running over the socket provider. I configured libfabric-1.5.3 with default providers; udp and socket are the only ones - plus rxm and rxd, but I don't think they apply.
FWIW, I saw the same hang with 1.3.0 and 1.4.2, and I see the same hang with OpenVPN and libfabric on QEMU (though I haven't looked into OpenVPN in as much detail).
It shouldn't matter, but I'm running QEMU/KVM on an AMD box, so there could be some hidden Intel-ism that's causing the problem. (My latent paranoia is showing...)
Thanks!
John
-----Original Message-----
From: Hefty, Sean [mailto:sean.hefty at intel.com]
Sent: Monday, February 05, 2018 2:15 PM
To: Wilkes, John <John.Wilkes at amd.com>; libfabric-users at lists.openfabrics.org; ofiwg at lists.openfabrics.org
Subject: RE: libfabric hangs on QEMU/KVM virtual cluster
copying ofiwg mailing list as well
Are you running over the socket provider?
I'm not aware of any issues running over QEMU, but I don't know of anyone who has tested it. I'll check on the testing with MPICH to see what's been tested and how recently it's been run.
- Sean
> I have a four node cluster of QEMU/KVM virtual machines. I installed
> MPICH-3.2 and ran the mpi-hello-world program with no problem.
>
>
>
> I installed libfabric-1.5.3 and ran fabtests-1.5.3:
>
>
>
> $ $PWD/runfabtests.sh -p /nfs/fabtests/bin sockets 192.168.100.201
> 192.168.100.203
>
>
>
> And all tests pass:
>
>
>
> # --------------------------------------------------------------
>
> # Total Pass 73
>
> # Total Notrun 0
>
> # Total Fail 0
>
> # Percentage of Pass 100
>
> # --------------------------------------------------------------
>
>
>
> I rebuilt MPICH after configuring it to use libfabric. I recompiled
> the mpi-hello-world program. When I run mpi-hello-world with
> libfabric, it prints the "hello" message from all four nodes but hangs
> in MPI_Finalize.
>
>
>
> I rebuilt libfabric and MPICH with debugging enabled and generated a
> log file when running mpi-hello-world on just two nodes (i.e. using "-
> n 2" instead of "-n 4"). The log file indicates that it is stuck
> "Waiting for 1 close operations", repeating "MPID_nem_ofi_poll" over
> and over until I stop the program with control-C:
>
> ...
>
> <"MPID_nem_ofi_poll"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
>
> >"MPID_nem_ofi_poll"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
>
> >"MPID_nem_ofi_cts_send_callback"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[188]
>
> >"MPID_nem_ofi_handle_packet"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[167]
>
> <"MPID_nem_ofi_handle_packet"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[175]
>
> <"MPID_nem_ofi_cts_send_callback"(9e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[191]
>
> >"MPID_nem_ofi_data_callback"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[124]
>
> <"MPID_nem_ofi_data_callback"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[173]
>
> <"MPID_nem_ofi_poll"(0.00404)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
>
> <MPIDI_CH3I_PROGRESS(0.00796)
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c[659]
>
> Waiting for 1 close operations
> src/mpid/ch3/src/ch3u_handle_connection.c[382]
>
> >MPIDI_CH3I_PROGRESS
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c[424]
>
> >"MPID_nem_ofi_poll"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
>
> <"MPID_nem_ofi_poll"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
>
> ...
>
>
>
> I get the same behavior with OpenVPN; mpi-hello-world prints the
> "hello" message from all four nodes and hangs. Without libfabric, it
> runs normally.
>
>
>
> Is there a known issue with libfabric on a QEMU/KVM virtual cluster?
> It seems like this should work.
>
>
>
> --
>
> John Wilkes | AMD Research | john.wilkes at amd.com
> <mailto:john.wilkes at amd.com> | office: +1 425.586.6412 (x26412)
>
>
More information about the ofiwg
mailing list