[Users] A simple MPI application hangs with MXM

Sébastien Boisvert Sebastien.Boisvert at clumeq.ca
Fri Sep 21 08:26:08 PDT 2012


Hello everyone,

We have an ofed cluster with a Mellanox interconnect.

Mellanox Messaging Accelerator (MXM) is a Mellanox product and Open-MPI supports
it too [ http://www.open-mpi.org//faq/?category=openfabrics#mxm ]

It includes

   "enhancements that significantly increase the scalability and performance 
   of message communications in the network, alleviating bottlenecks within 
   the parallel communication libraries."


Anyway, we at our super computer center installed mxm v1.1 (mxm_1.1.1341)
    [ http://mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar ]
and Open-MPI v1.6.1 [  http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.1.tar.bz2 ]. 
We compiled all of this with gcc 4.6.1.

To test the installation., we then compiled latency_checker v0.0.1 
     [  https://github.com/sebhtml/latency_checker/tarball/v0.0.1  ] 



Our interconnect is:

     Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0)

It is the old a0 revision, not the newer b0 revision.


The problem is that mxm hangs with only 2 processes, 
(I pressed CONTROL + C to send the signal SIGINT to the processes after 5 minutes):


$ mpiexec -n 2 -tag-output  -mca mtl_mxm_np 0  --mca mtl_base_verbose 999999 \
--mca mtl_mxm_verbose  999999 \
./latency_checker/latency_checker -exchanges 100000 -message-size 4000
[1,0]<stddiag>:[colosse1:12814] mca: base: components_open: Looking for mtl components
[1,1]<stddiag>:[colosse1:12815] mca: base: components_open: Looking for mtl components
[1,0]<stddiag>:[colosse1:12814] mca: base: components_open: opening mtl components
[1,0]<stddiag>:[colosse1:12814] mca: base: components_open: found loaded component mxm
[1,0]<stddiag>:[colosse1:12814] mca: base: components_open: component mxm register function successful
[1,1]<stddiag>:[colosse1:12815] mca: base: components_open: opening mtl components
[1,1]<stddiag>:[colosse1:12815] mca: base: components_open: found loaded component mxm
[1,1]<stddiag>:[colosse1:12815] mca: base: components_open: component mxm register function successful
[1,0]<stddiag>:[colosse1:12814] ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm_component.c:91 - ompi_mtl_mxm_component_open() mxm component open
[1,0]<stddiag>:[colosse1:12814] mca: base: components_open: component mxm open function successful
[1,1]<stddiag>:[colosse1:12815] ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm_component.c:91 - ompi_mtl_mxm_component_open() mxm component open
[1,1]<stddiag>:[colosse1:12815] mca: base: components_open: component mxm open function successful
[1,0]<stddiag>:[colosse1:12814] select: initializing mtl component mxm
[1,0]<stddiag>:[colosse1:12814] ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:194 - ompi_mtl_mxm_module_init() MXM support enabled
[1,0]<stddiag>:[colosse1:12814] ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:141 - ompi_mtl_mxm_create_ep() MXM version is old, consider to upgrade
[1,1]<stddiag>:[colosse1:12815] select: initializing mtl component mxm
[1,1]<stddiag>:[colosse1:12815] ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:194 - ompi_mtl_mxm_module_init() MXM support enabled
[1,1]<stddiag>:[colosse1:12815] ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:141 - ompi_mtl_mxm_create_ep() MXM version is old, consider to upgrade
[1,0]<stddiag>:[colosse1:12814] select: init returned success
[1,0]<stddiag>:[colosse1:12814] select: component mxm selected
[1,1]<stddiag>:[colosse1:12815] select: init returned success
[1,1]<stddiag>:[colosse1:12815] select: component mxm selected
[1,0]<stdout>:Initialized process with rank 0 (size is 2)
[1,0]<stdout>:rank 0 is starting the test
[1,1]<stdout>:Initialized process with rank 1 (size is 2)

mpiexec: killing job...

[1,1]<stderr>:MXM: Got signal 2 (Interrupt)

First, it says 'MXM version is old, consider to upgrade'. But we have the latest MXM version.


It seems that the problem is in MXM since MXM got the signal 2, which 
corresponds to SIGINT according to 'kill -l'.


The thing runs fine with 1 process though, but 1 process sending messages to itself
is silly and not scalable.


Here, I presume that mxm uses shared memory because my 2 processes 
are on the same machine. But I don't know for sure.

latency_checker works just fine with Open-MPI BTLs (self, sm, tcp, openib)
implemented with Obi-Wan Kenobi (Open-MPI's PML ob1) on an array of machines.

latency_checker also works just fine with the Open-MPI MTL psm from
Intel/QLogic, implemented with Connor MacLeod (Open-MPI PML cm).

And latency_checker also works fine with MPICH2 too.

So clearly something is wrong with the way the Open-MPI MTL called mxm 
interacts with the Open-MPI PML cm, either in the proprietary code from
Mellanox or in the mxm code that ships with Open-MPI.

To figure out the problem further, I ran the same command with a patched 
latency_checker that includes additional stuff in the standard output.


The log of process with rank=0:

DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Isend destination: 0 tag: MESSAGE_TAG_BEGIN_TEST (5)
DEBUG MPI_Isend destination: 1 tag: MESSAGE_TAG_BEGIN_TEST (5)
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Recv source: 0 tag: MESSAGE_TAG_BEGIN_TEST (5)
DEBUG MPI_Isend destination: 1 tag: MESSAGE_TAG_TEST_MESSAGE (0)
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
(followed by a endless string of MPI_Iprobe calls)


    => So rank 0 tells ranks 0 and 1 to begin the test (MESSAGE_TAG_BEGIN_TEST)
    => rank 0 receives the message that tells it to begin the test (MESSAGE_TAG_BEGIN_TEST)
    => rank 0 sends a test message to rank 1 (MESSAGE_TAG_TEST_MESSAGE)
    => rank 0 busy waits for a reply which never comes.


The log of process with rank=1

DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
(followed by a endless string of MPI_Iprobe calls)


    => It seems that rank 1 is unable to probe correctly anything at all.


The bottom line is that rank 1 is not probing any message although there are 
two messages waiting to be received:

 - one with tag MESSAGE_TAG_BEGIN_TEST from source 0 and 
 - one with tag MESSAGE_TAG_TEST_MESSAGE from source 0.




There is this thread on the Open-MPI mailing list describing a related issue:

[OMPI users] application with mxm hangs on startup
http://www.open-mpi.org/community/lists/users/2012/08/19980.php
 


There are also a few hits about mxm and hangs in the Open-MPI tickets:

   https://svn.open-mpi.org/trac/ompi/search?q=mxm+hang


Namely, this patch will be included in Open-MPI v1.6.2:

   https://svn.open-mpi.org/trac/ompi/ticket/3273


But it is about a hang after unloading the MXM component, so it is 
unrelated to what we see with mxm v1.1 (mxm_1.1.1341) and Open-MPI v1.6.1
because components are unloaded once MPI_Finalize is called I think.


So any pointers as to where we can go from here regarding MXM ?



Cheers, Sébastien

***

Sébastien Boisvert
Ph.D. student in Physiology-endocrinology
Université Laval
http://boisvert.info




More information about the Users mailing list