[ofiwg] libfabric hangs on QEMU/KVM virtual cluster

Wilkes, John John.Wilkes at amd.com
Mon Feb 5 16:50:43 PST 2018


I've run it with libfabric-1.5.3, which I think is the latest, and mpich-3.2. There is a mpich-3.2.1 now. I also tried OpenMPI-3.0.0, and I see there's OpenMPI-3.0.0-1 and 3.0.1rc2 available.

I'll grab the very latest MPICH and give it a try.

John

-----Original Message-----
From: Hefty, Sean [mailto:sean.hefty at intel.com] 
Sent: Monday, February 05, 2018 4:27 PM
To: Wilkes, John <John.Wilkes at amd.com>; libfabric-users at lists.openfabrics.org; ofiwg at lists.openfabrics.org
Subject: RE: libfabric hangs on QEMU/KVM virtual cluster

I don't have our test configuration handy, but we do run MPICH over libfabric sockets across 2 nodes as part of our CI testing.  This is against the tip of both trees, but no problems have been reported.

Are you able to try with later versions of either codebase?  I can't think of why libfabric master versus 1.5.3 would matter, but maybe MPICH has had some fixes.


> It's a four node cluster of QEMU/KVM VMs, each running Ubuntu 16.04 
> with kernel 4.4.0-112, x86_64. Node1 is a NFS server, and nodes 2, 3, 
> and 4 mount /nfs. The libfabric, fabtests, and mpich binaries are all 
> on /nfs.
> 
> Without libfabric:
> 
> $ /nfs/mpich3/bin/mpirun -f /nfs/hosts -n 4 
> /nfs/mpitests/mpi_hello_world.exe Hello world from processor node1, 
> rank 0 out of 4 processors Hello world from processor node3, rank 2 
> out of 4 processors Hello world from processor node2, rank 1 out of 4 
> processors Hello world from processor node4, rank 3 out of 4 
> processors $
> 
> With libfabric:
> 
> $ /nfs/mpich3/bin/mpirun -f /nfs/hosts -n 4 
> /nfs/mpitests/mpi_hello_world.exe Hello world from processor node3, 
> rank 2 out of 4 processors Hello world from processor node4, rank 3 
> out of 4 processors Hello world from processor node1, rank 0 out of 4 
> processors Hello world from processor node2, rank 1 out of 4 
> processors ^C[mpiexec at node1] Sending Ctrl-C to processes as requested 
> [mpiexec at node1] Press Ctrl-C again to force abort $
> 
> John
> 
> -----Original Message-----
> From: Hefty, Sean [mailto:sean.hefty at intel.com]
> Sent: Monday, February 05, 2018 3:24 PM
> To: Wilkes, John <John.Wilkes at amd.com>; libfabric- 
> users at lists.openfabrics.org; ofiwg at lists.openfabrics.org
> Subject: RE: libfabric hangs on QEMU/KVM virtual cluster
> 
> > Yes, running over the socket provider. I configured libfabric-1.5.3 
> > with default providers; udp and socket are the only ones - plus rxm 
> > and rxd, but I don't think they apply.
> >
> > FWIW, I saw the same hang with 1.3.0 and 1.4.2, and I see the same 
> > hang with OpenVPN and libfabric on QEMU (though I haven't looked
> into
> > OpenVPN in as much detail).
> >
> > It shouldn't matter, but I'm running QEMU/KVM on an AMD box, so
> there
> > could be some hidden Intel-ism that's causing the problem. (My
> latent
> > paranoia is showing...)
> 
> The socket provider is standard BSD sockets, without any CPU specific 
> code.  That will change in v1.6.0 in order to add CPU specific 
> instructions to handle persistent memory.  But the code should still 
> work fine across any supported platform.  I'm just limited on my 
> testing environment.
> 
> Is the VM 32-bit or 64-bit?
> 
> - Sean



More information about the ofiwg mailing list