[Users] A simple MPI application hangs with MXM

Thu Nov 1 09:37:40 PDT 2012

Hi Alina,

Regarding the increase in latency that we see when we increase
the number of processes, it really is because our Mellanox MT26428
interconnect can not provide a good service to 8 UNIX processes per node.

I think this is not related to software, at least not the userland.

With a hybrid programming model ("mini-ranks" implemented with MPI and
IEEE POSIX threads), the latency remains constant.

Table 1: Comparison of MPI ranks with mini-ranks on the Colosse
super-computer at Laval University.
+-------+---------------------------------------------------+
| Cores | Average round-trip latency (us)                   |
+-------+-----------------------+---------------------------+
|       | MPI ranks             | mini-ranks                |
|       | (pure MPI)            | (MPI + pthread)           |
+-------+-----------------------+---------------------------+
| 8     | 11.25 +/- 0           | 24.1429 +/- 0             |
| 16    | 35.875 +/- 6.92369    | 43.0179 +/- 8.76275       |
| 32    | 66.3125 +/- 6.76387   | 41.7143 +/- 1.23924       |
| 64    | 90 +/- 16.5265        | 37.75 +/- 6.41984         |
| 128   | 126.562 +/- 25.0116	| 43.0179 +/- 8.76275       |
| 256   | 203.637 +/- 67.4579   | 44.6429 +/- 6.11862       |
+-------+-----------------------+---------------------------+

I you would like more information about how I did Table 1, I will
be more than happy to provide it.

Maybe that's the hardware revision (rev. a0 according to lspci) that dodgy.
I don't know for sure.

0d:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0)

What do you think ?

               Sébastien

---
Sent from my IBM Blue Gene/Q

On 10/09/2012 09:32 AM, Alina Sklarevich wrote:
> Hello Sébastien,
>
> We are looking into this issue and will update you soon.
>
> Thanks,
> Alina.
>
> -----Original Message-----
> From: Sébastien Boisvert [mailto:Sebastien.Boisvert at clumeq.ca]
> Sent: Tuesday, September 25, 2012 11:07 PM
> To: Mike Dubman
> Cc: OpenFabrics Software users; Todd Wilde; Florent Parent; Jean-Francois Le Fillatre; Susan Coulter; sebastien.boisvert.3 at ulaval.ca; Aleksey Senin; Alina Sklarevich
> Subject: Re: A simple MPI application hangs with MXM
>
> Hello,
>
>
> Several changes are related to Mellanox MXM in
>    http://svn.open-mpi.org/svn/ompi/branches/v1.6/NEWS:
>
> 1.6.2
> -----
>
> - Fix issue with MX MTL.  Thanks to Doug Eadline for raising the issue.
> - Fix singleton MPI_COMM_SPAWN when the result job spans multiple nodes.
> - Fix MXM hang, and update for latest version of MXM.
> - Update to support Mellanox FCA 2.5.
> - Fix startup hang for large jobs.
> - Ensure MPI_TESTANY / MPI_WAITANY properly set the empty status when
>    count==0.
> - Fix MPI_CART_SUB behavior of not copying periods to the new
>    communicator properly.  Thanks to John Craske for the bug report.
> - Add btl_openib_abort_not_enough_reg_mem MCA parameter to cause Open
>    MPI to abort MPI jobs if there is not enough registered memory
>    available on the system (vs. just printing a warning).  Thanks to
>    Brock Palen for raising the issue.
> - Minor fix to Fortran MPI_INFO_GET: only copy a value back to the
>    user's buffer if the flag is .TRUE.
> - Fix VampirTrace compilation issue with the PGI compiler suite.
>
>
>
>
> With Open-MPI 1.6.2 that was released today the simple code still hangs.
>
> $ type mpiexec
> mpiexec is hashed (/rap/clumeq/Seb-Boisvert/Open-MPI-1.6.2+MXM/build-mxm/bin/mpiexec)
>
> $ mpiexec --mca mtl_mxm_np 0 -n 2 ./latency_checker Initialized process with rank 0 (size is 2) Initialized process with rank 1 (size is 2) rank 0 is starting the test
> MXM: Got signal 2 (Interrupt)
> MXM: Got signal 2 (Interrupt)
> mpiexec: killing job...
>
>
> The code runs just fine with Open-MPI BTLs:
>
> $ mpiexec -n 2 ./latency_checker
>
> (runs OK)
>
>
> As I said in the other email, it seems that MPI_Iprobe never see any incoming messages
> when using MXM (the flag remains 0). Thus the hang.
>
>
> Let me know if there is something I can do to help.
>
>
> Sébastien
>
>
> On 21/09/12 11:58 AM, Mike Dubman wrote:
>> Hi Sebasien,
>> Thank you for your report. Could you please check that latest mxm is indeed installed on all nodes (rpm -qi mxm) ?
>> Also, please make sure that mpiexec you use refers to the newly compiled ompi? (please run with a full path and add "--display-map")
>>
>> Aleksey - please follow this report.
>>
>> Thanks
>> M
>>
>>> -----Original Message-----
>>> From: Sébastien Boisvert [mailto:Sebastien.Boisvert at clumeq.ca]
>>> Sent: 21 September, 2012 18:26
>>> To: OpenFabrics Software users; Mike Dubman; Todd Wilde; Florent Parent;
>>> Jean-Francois Le Fillatre; Susan Coulter; sebastien.boisvert.3 at ulaval.ca
>>> Subject: A simple MPI application hangs with MXM
>>>
>>> Hello everyone,
>>>
>>> We have an ofed cluster with a Mellanox interconnect.
>>>
>>> Mellanox Messaging Accelerator (MXM) is a Mellanox product and Open-MPI
>>> supports it too [ http://www.open-
>>> mpi.org//faq/?category=openfabrics#mxm ]
>>>
>>> It includes
>>>
>>>     "enhancements that significantly increase the scalability and performance
>>>     of message communications in the network, alleviating bottlenecks within
>>>     the parallel communication libraries."
>>>
>>>
>>> Anyway, we at our super computer center installed mxm v1.1
>>> (mxm_1.1.1341)
>>>      [ http://mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar ] and
>>> Open-MPI v1.6.1 [  http://www.open-
>>> mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.1.tar.bz2 ].
>>> We compiled all of this with gcc 4.6.1.
>>>
>>> To test the installation., we then compiled latency_checker v0.0.1
>>>       [  https://github.com/sebhtml/latency_checker/tarball/v0.0.1  ]
>>>
>>>
>>>
>>> Our interconnect is:
>>>
>>>       Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR /
>>> 10GigE] (rev a0)
>>>
>>> It is the old a0 revision, not the newer b0 revision.
>>>
>>>
>>> The problem is that mxm hangs with only 2 processes, (I pressed CONTROL +
>>> C to send the signal SIGINT to the processes after 5 minutes):
>>>
>>>
>>> $ mpiexec -n 2 -tag-output  -mca mtl_mxm_np 0  --mca mtl_base_verbose
>>> 999999 \
>>> --mca mtl_mxm_verbose  999999 \
>>> ./latency_checker/latency_checker -exchanges 100000 -message-size 4000
>>> [1,0]<stddiag>:[colosse1:12814] mca: base: components_open: Looking for
>>> mtl components
>>> [1,1]<stddiag>:[colosse1:12815] mca: base: components_open: Looking for
>>> mtl components
>>> [1,0]<stddiag>:[colosse1:12814] mca: base: components_open: opening mtl
>>> components
>>> [1,0]<stddiag>:[colosse1:12814] mca: base: components_open: found
>>> loaded component mxm
>>> [1,0]<stddiag>:[colosse1:12814] mca: base: components_open: component
>>> mxm register function successful
>>> [1,1]<stddiag>:[colosse1:12815] mca: base: components_open: opening mtl
>>> components
>>> [1,1]<stddiag>:[colosse1:12815] mca: base: components_open: found
>>> loaded component mxm
>>> [1,1]<stddiag>:[colosse1:12815] mca: base: components_open: component
>>> mxm register function successful
>>> [1,0]<stddiag>:[colosse1:12814]
>>> ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm_component.c:91 -
>>> ompi_mtl_mxm_component_open() mxm component open
>>> [1,0]<stddiag>:[colosse1:12814] mca: base: components_open: component
>>> mxm open function successful
>>> [1,1]<stddiag>:[colosse1:12815]
>>> ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm_component.c:91 -
>>> ompi_mtl_mxm_component_open() mxm component open
>>> [1,1]<stddiag>:[colosse1:12815] mca: base: components_open: component
>>> mxm open function successful
>>> [1,0]<stddiag>:[colosse1:12814] select: initializing mtl component mxm
>>> [1,0]<stddiag>:[colosse1:12814]
>>> ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:194 -
>>> ompi_mtl_mxm_module_init() MXM support enabled
>>> [1,0]<stddiag>:[colosse1:12814]
>>> ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:141 -
>>> ompi_mtl_mxm_create_ep() MXM version is old, consider to upgrade
>>> [1,1]<stddiag>:[colosse1:12815] select: initializing mtl component mxm
>>> [1,1]<stddiag>:[colosse1:12815]
>>> ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:194 -
>>> ompi_mtl_mxm_module_init() MXM support enabled
>>> [1,1]<stddiag>:[colosse1:12815]
>>> ../../../../../source/ompi/mca/mtl/mxm/mtl_mxm.c:141 -
>>> ompi_mtl_mxm_create_ep() MXM version is old, consider to upgrade
>>> [1,0]<stddiag>:[colosse1:12814] select: init returned success
>>> [1,0]<stddiag>:[colosse1:12814] select: component mxm selected
>>> [1,1]<stddiag>:[colosse1:12815] select: init returned success
>>> [1,1]<stddiag>:[colosse1:12815] select: component mxm selected
>>> [1,0]<stdout>:Initialized process with rank 0 (size is 2)
>>> [1,0]<stdout>:rank 0 is starting the test
>>> [1,1]<stdout>:Initialized process with rank 1 (size is 2)
>>>
>>> mpiexec: killing job...
>>>
>>> [1,1]<stderr>:MXM: Got signal 2 (Interrupt)
>>>
>>> First, it says 'MXM version is old, consider to upgrade'. But we have the latest
>>> MXM version.
>>>
>>>
>>> It seems that the problem is in MXM since MXM got the signal 2, which
>>> corresponds to SIGINT according to 'kill -l'.
>>>
>>>
>>> The thing runs fine with 1 process though, but 1 process sending messages to
>>> itself
>>> is silly and not scalable.
>>>
>>>
>>> Here, I presume that mxm uses shared memory because my 2 processes
>>> are on the same machine. But I don't know for sure.
>>>
>>> latency_checker works just fine with Open-MPI BTLs (self, sm, tcp, openib)
>>> implemented with Obi-Wan Kenobi (Open-MPI's PML ob1) on an array of
>>> machines.
>>>
>>> latency_checker also works just fine with the Open-MPI MTL psm from
>>> Intel/QLogic, implemented with Connor MacLeod (Open-MPI PML cm).
>>>
>>> And latency_checker also works fine with MPICH2 too.
>>>
>>> So clearly something is wrong with the way the Open-MPI MTL called mxm
>>> interacts with the Open-MPI PML cm, either in the proprietary code from
>>> Mellanox or in the mxm code that ships with Open-MPI.
>>>
>>> To figure out the problem further, I ran the same command with a patched
>>> latency_checker that includes additional stuff in the standard output.
>>>
>>>
>>> The log of process with rank=0:
>>>
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Isend destination: 0 tag: MESSAGE_TAG_BEGIN_TEST (5)
>>> DEBUG MPI_Isend destination: 1 tag: MESSAGE_TAG_BEGIN_TEST (5)
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Recv source: 0 tag: MESSAGE_TAG_BEGIN_TEST (5)
>>> DEBUG MPI_Isend destination: 1 tag: MESSAGE_TAG_TEST_MESSAGE (0)
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> (followed by a endless string of MPI_Iprobe calls)
>>>
>>>
>>>      => So rank 0 tells ranks 0 and 1 to begin the test
>>> (MESSAGE_TAG_BEGIN_TEST)
>>>      => rank 0 receives the message that tells it to begin the test
>>> (MESSAGE_TAG_BEGIN_TEST)
>>>      => rank 0 sends a test message to rank 1 (MESSAGE_TAG_TEST_MESSAGE)
>>>      => rank 0 busy waits for a reply which never comes.
>>>
>>>
>>> The log of process with rank=1
>>>
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> DEBUG MPI_Iprobe source: MPI_ANY_SOURCE tag: MPI_ANY_TAG
>>> (followed by a endless string of MPI_Iprobe calls)
>>>
>>>
>>>      => It seems that rank 1 is unable to probe correctly anything at all.
>>>
>>>
>>> The bottom line is that rank 1 is not probing any message although there are
>>> two messages waiting to be received:
>>>
>>>   - one with tag MESSAGE_TAG_BEGIN_TEST from source 0 and
>>>   - one with tag MESSAGE_TAG_TEST_MESSAGE from source 0.
>>>
>>>
>>>
>>>
>>> There is this thread on the Open-MPI mailing list describing a related issue:
>>>
>>> [OMPI users] application with mxm hangs on startup
>>> http://www.open-mpi.org/community/lists/users/2012/08/19980.php
>>>
>>>
>>>
>>> There are also a few hits about mxm and hangs in the Open-MPI tickets:
>>>
>>>     https://svn.open-mpi.org/trac/ompi/search?q=mxm+hang
>>>
>>>
>>> Namely, this patch will be included in Open-MPI v1.6.2:
>>>
>>>     https://svn.open-mpi.org/trac/ompi/ticket/3273
>>>
>>>
>>> But it is about a hang after unloading the MXM component, so it is
>>> unrelated to what we see with mxm v1.1 (mxm_1.1.1341) and Open-MPI
>>> v1.6.1
>>> because components are unloaded once MPI_Finalize is called I think.
>>>
>>>
>>> So any pointers as to where we can go from here regarding MXM ?
>>>
>>>
>>>
>>> Cheers, Sébastien
>>>
>>> ***
>>>
>>> Sébastien Boisvert
>>> Ph.D. student in Physiology-endocrinology
>>> Université Laval
>>> http://boisvert.info
>>
>