<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">
<p>As is usual in these cases - I found a nasty bug in my code after I posted my message to the list. It turns out that I was sending messages that were bigger than I thought they were - due to the way memory was allocated, they were rounded up to the next
power of 2 - so a 100,000 byte message was actually 131,072 bytes - a tidy 30% larger than expected, and this accounts for the 30% BW difference on large messages. On smalller ones the difference was not such a big deal and masked by latencies, but for the
larger sizes, it hammered my benchmark numbers.</p>
<p><br>
</p>
<p>Apologies for the noise.</p>
<p><br>
</p>
<p>JB</p>
<p><br>
</p>
</div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Libfabric-users <libfabric-users-bounces@lists.openfabrics.org> on behalf of Biddiscombe, John A. <john.biddiscombe@cscs.ch><br>
<b>Sent:</b> 09 June 2022 14:48:43<br>
<b>To:</b> libfabric-users@lists.openfabrics.org<br>
<b>Subject:</b> [libfabric-users] Suggestions needed for improved performance</font>
<div> </div>
</div>
<div>
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">
<p>Dear list,</p>
<p><br>
</p>
<p>I'm looking suggestions on things to try - One of our benchmarks that uses libfabric, performs well enough with small messages. The benchmark is written in such a way that we can swap the back-end for a native MPI implementation, or a libfabric implementation
and compare performance. The test uses tagged sends and receives between two nodes and simply does lots of them with a certain number of messages allowed o be 'in flight' per thread at any moment.</p>
<p><br>
</p>
<p>On piz daint, the cray machine at CSCS</p>
<p><br>
</p>
<p>8 threads, message size 1 byte, 10 per thread in flight at any time</p>
<p>libfabric : <span>0.80 MB/s</span></p>
<p>mpi : <span></span>0.38 MB/s<br>
</p>
<p>speedup 2x<br>
</p>
<p><br>
</p>
<p></p>
<p>8 threads, message size 100 byte, 10 per thread in flight at any time</p>
<p>libfabric : 85<span> MB/s</span></p>
<p>mpi : <span></span>37 MB/s<br>
</p>
<p>speedup 2x</p>
<p><br>
</p>
<p>8 threads, message size 10000 byte, 10 per thread in flight at any time</p>
<p>libfabric : 3600<span> MB/s</span></p>
<p>mpi : <span></span>2000 MB/s<br>
</p>
speedup 1.5x
<p></p>
<p><br>
</p>
<p></p>
<p>8 threads, message size 100000 byte, 10 per thread in flight at any time</p>
<p>libfabric : 10800<span> MB/s</span></p>
<p>mpi : 13900 MB/s<br>
</p>
<p>We are now lagging well behind mpi, which is reaching the approx BW of the system (as expected, similar to OSU benchmark)</p>
<p></p>
<p><br>
</p>
<p>The benchmark uses message buffer objects, which have a custom allocator, all memory from this allocator is pinned using
<span>fi_mr_reg</span> (we use FI_MR_BASIC mode). So there is no pinning of memory during the benchmark run - everything is pinned in advance when the memory buffers are created at startup. The messages are sent using tagged send and each buffer has the memory
descriptor supplied </p>
<p></p>
<div> execute_fi_function(fi_tsend, "fi_tsend",<br>
m_tx_endpoint.get_ep(), send_region.get_address(), send_region.get_size(),<br>
send_region.get_local_key(), dst_addr_, tag_, ctxt);<br>
So the question is - what could be going wrong for the libfabric backend that causes such a significant drop in relative performance with larger messages. I've experimented with different FI_THREAD_SAFE options and removing/putting locks around the injection
and polling code, but since we perform well with small messages - I do not think there is anything wrong with the basic framework around the send/recv and polling functions. It would appear to be a memory size issue. Is libfabric assuming that the buffers
are not pinned and wasting time trying to pin them again?</div>
<div><br>
</div>
<div>One caveat, the benchmark uses MPI to initialize and so the libfabric tests are coexisting with MPI in the same executable (and using the GNI backend). I was running tests on LUMI (verbs backend) and saw similar speed drops (but on lumi the mpi uses the
libfabric backend too), but cannot access the machine now until maintenance is over.</div>
<div>On daint, I launch using <span>MPICH_GNI_NDREG_ENTRIES=1024</span>, set the mem reg to udreg and lazy dreg to true (no that gni should be registering much since we've done it already)<br>
</div>
<div><br>
</div>
<div>I welcome any suggestions of what mpi might be doing better, or we might be doing wrong. (I tried profiling and saw no obvious hotspots in our code, the major time hog was in polling the receive queues).</div>
<div><br>
</div>
<div>Many thanks</div>
<div><br>
</div>
<div>JB<br>
</div>
<div><br>
</div>
<p></p>
</div>
</div>
</body>
</html>