[ofiwg] libfabric hangs on QEMU/KVM virtual cluster

Mon Feb 5 14:21:46 PST 2018

Yes, running over the socket provider. I configured libfabric-1.5.3 with default providers; udp and socket are the only ones - plus rxm and rxd, but I don't think they apply.

FWIW, I saw the same hang with 1.3.0 and 1.4.2, and I see the same hang with OpenVPN and libfabric on QEMU (though I haven't looked into OpenVPN in as much detail).

It shouldn't matter, but I'm running QEMU/KVM on an AMD box, so there could be some hidden Intel-ism that's causing the problem. (My latent paranoia is showing...)

Thanks!

John

-----Original Message-----
From: Hefty, Sean [mailto:sean.hefty at intel.com] 
Sent: Monday, February 05, 2018 2:15 PM
To: Wilkes, John <John.Wilkes at amd.com>; libfabric-users at lists.openfabrics.org; ofiwg at lists.openfabrics.org
Subject: RE: libfabric hangs on QEMU/KVM virtual cluster

copying ofiwg mailing list as well

Are you running over the socket provider?

I'm not aware of any issues running over QEMU, but I don't know of anyone who has tested it.  I'll check on the testing with MPICH to see what's been tested and how recently it's been run.

- Sean

> I have a four node cluster of QEMU/KVM virtual machines. I installed
> MPICH-3.2 and ran the mpi-hello-world program with no problem.
> 
> 
> 
> I installed libfabric-1.5.3 and ran fabtests-1.5.3:
> 
> 
> 
> $ $PWD/runfabtests.sh -p /nfs/fabtests/bin sockets 192.168.100.201
> 192.168.100.203
> 
> 
> 
> And all tests pass:
> 
> 
> 
> # --------------------------------------------------------------
> 
> # Total Pass                                                73
> 
> # Total Notrun                                               0
> 
> # Total Fail                                                 0
> 
> # Percentage of Pass                                       100
> 
> # --------------------------------------------------------------
> 
> 
> 
> I rebuilt MPICH after configuring it to use libfabric. I recompiled 
> the mpi-hello-world program. When I run mpi-hello-world with 
> libfabric, it prints the "hello" message from all four nodes but hangs 
> in MPI_Finalize.
> 
> 
> 
> I rebuilt libfabric and MPICH with debugging enabled and generated a 
> log file when running mpi-hello-world on just two nodes (i.e. using "- 
> n 2" instead of "-n 4"). The log file indicates that it is stuck 
> "Waiting for 1 close operations", repeating "MPID_nem_ofi_poll" over 
> and over until I stop the program with control-C:
> 
> ...
> 
>   <"MPID_nem_ofi_poll"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
> 
>   >"MPID_nem_ofi_poll"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
> 
>    >"MPID_nem_ofi_cts_send_callback"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[188]
> 
>     >"MPID_nem_ofi_handle_packet"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[167]
> 
>     <"MPID_nem_ofi_handle_packet"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[175]
> 
>    <"MPID_nem_ofi_cts_send_callback"(9e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[191]
> 
>    >"MPID_nem_ofi_data_callback"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[124]
> 
>    <"MPID_nem_ofi_data_callback"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[173]
> 
>   <"MPID_nem_ofi_poll"(0.00404)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
> 
> <MPIDI_CH3I_PROGRESS(0.00796)
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c[659]
> 
> Waiting for 1 close operations
> src/mpid/ch3/src/ch3u_handle_connection.c[382]
> 
> >MPIDI_CH3I_PROGRESS
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c[424]
> 
>   >"MPID_nem_ofi_poll"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
> 
>   <"MPID_nem_ofi_poll"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
> 
> ...
> 
> 
> 
> I get the same behavior with OpenVPN; mpi-hello-world prints the 
> "hello" message from all four nodes and hangs. Without libfabric, it 
> runs normally.
> 
> 
> 
> Is there a known issue with libfabric on a QEMU/KVM virtual cluster?
> It seems like this should work.
> 
> 
> 
> --
> 
> John Wilkes | AMD Research |  john.wilkes at amd.com 
> <mailto:john.wilkes at amd.com>  | office: +1 425.586.6412 (x26412)
> 
>