<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"Arial Black";
panose-1:2 11 10 4 2 1 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle20
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">Hi Eric,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Which version of Intel MPI 2019 are you using? Can you try running with FI_LOG_LEVEL=warn and share the logs?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal">Arun.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> Libfabric-users <libfabric-users-bounces@lists.openfabrics.org>
<b>On Behalf Of </b>Walter, Eric J<br>
<b>Sent:</b> Tuesday, August 06, 2019 10:48 AM<br>
<b>To:</b> libfabric-users@lists.openfabrics.org<br>
<b>Subject:</b> [libfabric-users] libfabric/intel mpi with mlx5 and > 300 cores/ranks<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div id="Signature">
<div id="divtagdefaultwrapper">
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Hi, we are currently standing up a new cluster with Mellanox ConnectX-5 adapters. I have found that using openMPI, mvapich2, and intel2018-mpi, we can run MPI jobs on all 960 cores in the cluster,
however, using intel2019-mpi we can't get beyond ~300 mpi ranks. If we do, we get the following error for every rank:
<o:p></o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Abort(273768207) on node 650 (rank 650 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=650, new_comm=0x7911e8) failed
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">PMPI_Comm_split(489)...................:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPIR_Comm_split_impl(167)..............:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPIR_Allgather_intra_auto(145).........: Failure during collective
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPIR_Allgather_intra_auto(141).........:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPIR_Allgather_intra_brucks(115).......:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPIC_Sendrecv(344).....................:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPID_Isend(662)........................:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPID_isend_unsafe(282).................:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">MPIDI_OFI_send_lightweight_request(106):
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">(unknown)(): Other MPI error
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">----------------------------------------------------------------------------------------------------------
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">This is using the default FI_PROVIDER of ofi_rxm. If we switch to using "verbs", we can run all 960 cores, but tests show an order of magnitude increase in latency and much longer run times.
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">We have tried installing our own libfabrics (from the git repo ; also we verified with verbose debugging that we are using this libfabrics) and this behavoir does not change<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Is there anything I can change to allow all 960 cores using the default ofi_rxm provider? Or, is there a way to improve performance using the verbs provider?<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">For completeness:
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Using MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.6-x86_64 ofed
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">CentOS 7.6.1810 (kernel = 3.10.0-957.21.3.el7.x86_64)
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Intel Parallel studio version 19.0.4.243
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Thanks! <o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Eric<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">-- </span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"Arial Black",sans-serif;color:black">Eric J. Walter</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">College of William and Mary </span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">IT/High Performance Computing Group </span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">ISC 1271 </span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">P.O. Box 8795</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">Williamsburg, VA 23187-8795</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="font-size:10.5pt;font-family:"Arial Black",sans-serif;color:black">email: <a href="mailto:ejwalt@wm.edu">ejwalt@wm.edu</a></span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">phone: (757) 221-1886</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial Black",sans-serif;color:black">fax: (757) 221-1321</span><span style="font-size:12.0pt;color:black"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
</div>
</div>
</div>
</div>
</body>
</html>