[ofiwg] libfabric hangs on QEMU/KVM virtual cluster

Wilkes, John John.Wilkes at amd.com
Mon Feb 5 15:40:30 PST 2018


It's a four node cluster of QEMU/KVM VMs, each running Ubuntu 16.04 with kernel 4.4.0-112, x86_64. Node1 is a NFS server, and nodes 2, 3, and 4 mount /nfs. The libfabric, fabtests, and mpich binaries are all on /nfs.

Without libfabric:

$ /nfs/mpich3/bin/mpirun -f /nfs/hosts -n 4 /nfs/mpitests/mpi_hello_world.exe
Hello world from processor node1, rank 0 out of 4 processors
Hello world from processor node3, rank 2 out of 4 processors
Hello world from processor node2, rank 1 out of 4 processors
Hello world from processor node4, rank 3 out of 4 processors
$ 

With libfabric:

$ /nfs/mpich3/bin/mpirun -f /nfs/hosts -n 4 /nfs/mpitests/mpi_hello_world.exe 
Hello world from processor node3, rank 2 out of 4 processors
Hello world from processor node4, rank 3 out of 4 processors
Hello world from processor node1, rank 0 out of 4 processors
Hello world from processor node2, rank 1 out of 4 processors
^C[mpiexec at node1] Sending Ctrl-C to processes as requested
[mpiexec at node1] Press Ctrl-C again to force abort
$ 

John

-----Original Message-----
From: Hefty, Sean [mailto:sean.hefty at intel.com] 
Sent: Monday, February 05, 2018 3:24 PM
To: Wilkes, John <John.Wilkes at amd.com>; libfabric-users at lists.openfabrics.org; ofiwg at lists.openfabrics.org
Subject: RE: libfabric hangs on QEMU/KVM virtual cluster

> Yes, running over the socket provider. I configured libfabric-1.5.3 
> with default providers; udp and socket are the only ones - plus rxm 
> and rxd, but I don't think they apply.
> 
> FWIW, I saw the same hang with 1.3.0 and 1.4.2, and I see the same 
> hang with OpenVPN and libfabric on QEMU (though I haven't looked into 
> OpenVPN in as much detail).
> 
> It shouldn't matter, but I'm running QEMU/KVM on an AMD box, so there 
> could be some hidden Intel-ism that's causing the problem. (My latent 
> paranoia is showing...)

The socket provider is standard BSD sockets, without any CPU specific code.  That will change in v1.6.0 in order to add CPU specific instructions to handle persistent memory.  But the code should still work fine across any supported platform.  I'm just limited on my testing environment.

Is the VM 32-bit or 64-bit?

- Sean



More information about the ofiwg mailing list