[ofiwg] libfabric hangs on QEMU/KVM virtual cluster

Hefty, Sean sean.hefty at intel.com
Mon Feb 5 14:15:23 PST 2018


copying ofiwg mailing list as well

Are you running over the socket provider?

I'm not aware of any issues running over QEMU, but I don't know of anyone who has tested it.  I'll check on the testing with MPICH to see what's been tested and how recently it's been run.

- Sean


> I have a four node cluster of QEMU/KVM virtual machines. I installed
> MPICH-3.2 and ran the mpi-hello-world program with no problem.
> 
> 
> 
> I installed libfabric-1.5.3 and ran fabtests-1.5.3:
> 
> 
> 
> $ $PWD/runfabtests.sh -p /nfs/fabtests/bin sockets 192.168.100.201
> 192.168.100.203
> 
> 
> 
> And all tests pass:
> 
> 
> 
> # --------------------------------------------------------------
> 
> # Total Pass                                                73
> 
> # Total Notrun                                               0
> 
> # Total Fail                                                 0
> 
> # Percentage of Pass                                       100
> 
> # --------------------------------------------------------------
> 
> 
> 
> I rebuilt MPICH after configuring it to use libfabric. I recompiled
> the mpi-hello-world program. When I run mpi-hello-world with
> libfabric, it prints the “hello” message from all four nodes but hangs
> in MPI_Finalize.
> 
> 
> 
> I rebuilt libfabric and MPICH with debugging enabled and generated a
> log file when running mpi-hello-world on just two nodes (i.e. using “-
> n 2” instead of “-n 4”). The log file indicates that it is stuck
> “Waiting for 1 close operations”, repeating “MPID_nem_ofi_poll” over
> and over until I stop the program with control-C:
> 
> ...
> 
>   <"MPID_nem_ofi_poll"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
> 
>   >"MPID_nem_ofi_poll"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
> 
>    >"MPID_nem_ofi_cts_send_callback"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[188]
> 
>     >"MPID_nem_ofi_handle_packet"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[167]
> 
>     <"MPID_nem_ofi_handle_packet"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[175]
> 
>    <"MPID_nem_ofi_cts_send_callback"(9e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_cm.c[191]
> 
>    >"MPID_nem_ofi_data_callback"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[124]
> 
>    <"MPID_nem_ofi_data_callback"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_msg.c[173]
> 
>   <"MPID_nem_ofi_poll"(0.00404)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
> 
> <MPIDI_CH3I_PROGRESS(0.00796)
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c[659]
> 
> Waiting for 1 close operations
> src/mpid/ch3/src/ch3u_handle_connection.c[382]
> 
> >MPIDI_CH3I_PROGRESS
> src/mpid/ch3/channels/nemesis/src/ch3_progress.c[424]
> 
>   >"MPID_nem_ofi_poll"
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[45]
> 
>   <"MPID_nem_ofi_poll"(3e-06)
> src/mpid/ch3/channels/nemesis/netmod/ofi/ofi_progress.c[124]
> 
> ...
> 
> 
> 
> I get the same behavior with OpenVPN; mpi-hello-world prints the
> “hello” message from all four nodes and hangs. Without libfabric, it
> runs normally.
> 
> 
> 
> Is there a known issue with libfabric on a QEMU/KVM virtual cluster?
> It seems like this should work.
> 
> 
> 
> --
> 
> John Wilkes | AMD Research |  john.wilkes at amd.com
> <mailto:john.wilkes at amd.com>  | office: +1 425.586.6412 (x26412)
> 
> 




More information about the ofiwg mailing list