<div dir="ltr">Hi John,<div><br></div><div>The control messages that the GNI provider uses internally (GNI SMSG messages) use a circular queue</div><div>mechanism with credits to protect from overrun of the queue. Also, for the RX hardware</div><div>CQs - which can get overrun if you are sending many small messages to a single endpoint - there's a plan</div><div>B fallback to go through all of the mailboxes (the circular queues) and look for new messages until the</div><div>RX CQ is no longer in overflow state. You can think of the RX CQ as an optimization for notifying us which</div><div>mailbox to check for new messages. This procedure has been used for years in the Cray MPI (since the first Cray XE's came out), so its gotten lots of testing over the years. </div><div><br></div><div>Small messages injected via fi_tsend and friends use the SMSG protocol while larger messages go</div><div>through a combination of SMSG control message and RDMA reads. </div><div><br></div><div>Hope this helps some,</div><div><br></div><div>Howard</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2017-04-14 6:54 GMT-06:00 Biddiscombe, John A. <span dir="ltr"><<a href="mailto:biddisco@cscs.ch" target="_blank">biddisco@cscs.ch</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div lang="EN-GB" link="blue" vlink="purple">
<div class="m_7629548046710581567WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Howard<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<div><span class="">
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">></span><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">This looks like heap corruption somehow. Could you try rebuilding libfabric<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">with --enable-debug and set FI_LOG_LEVEL to warn and see if that gives<u></u><u></u></p>
</div>
</span><div>
<p class="MsoNormal">more info?<u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><<u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">After making some changes to our code, the problem seems to have gone away. We limit the number of messages that a node can send at a time to a smaller number
and things seem to behave better. <u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">A more general question though:<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Does libfabric have any flow control mechanism built in? If I send a large number of message from many nodes to one single node - once the preposted receives
are exhausted - what behaviour can I expect from libfabric. Will messages be resent - or will the network layer transition into an error state from which it is difficult to recover.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Experiments indicate that libfabric is handling large numbers of messages without returning errors, but I’m curious to know what knobs/controls exist to allow
us to adjust behaviour and mange flow control.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Thanks<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">JB<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
</div>
</div>
</div>
</div>
</blockquote></div><br></div>