<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Oct 19, 2016 at 3:57 AM, Peter Boyle <span dir="ltr"><<a href="mailto:paboyle@ph.ed.ac.uk" target="_blank">paboyle@ph.ed.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

Hi,<br>

<br>

I’m interested in trying to minimally modify a scientific library<br>

<br>

        <a href="http://www.github.com/paboyle/Grid/" rel="noreferrer" target="_blank">www.github.com/paboyle/Grid/</a><br>

<br>

to tackle obtaining good dual rail OPA performance on KNL<br>

from a single process per node.<br>

<br></blockquote><div><br></div><div>I will respond to you privately about platform-specific considerations.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

The code is naturally hybrid OpenMP + MPI and 1 process per node minimises internal<br>

communication in the node so would be the preferred us.<br>

<br>

The application currently has (compile time selection) SHMEM and MPI transport layers, but it appears<br>

that the MPI versions we have tried (OpenMPI, MVAPICH2, Intel MPI) have a big lock and no<br>

real multi-core concurrency from a single process.<br>

<br></blockquote><div><br></div><div>Indeed, fat locks are the status quo.  You probably know Blue Gene/Q supported fine-grain locking, and Cray MPI has something similar for KNL (<a href="https://cug.org/proceedings/cug2016_proceedings/includes/files/pap140-file2.pdf">https://cug.org/proceedings/cug2016_proceedings/includes/files/pap140-file2.pdf</a>, <a href="https://cug.org/proceedings/cug2016_proceedings/includes/files/pap140.pdf">https://cug.org/proceedings/cug2016_proceedings/includes/files/pap140.pdf</a>) but the implementation in Cray MPI is not using OFI (rather uGNI and DMAPP).</div><div><br></div><div>Unfortunately, even fine-grain locking isn't sufficient to get ideal concurrency due to MPI semantics.  MPI-4 may be able to help here, but it will be a while before this standard exists and likely more time before endpoints and/or communicator info-assertions are implemented sufficient to support ideal concurrency on networks that can do it.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Is there any wisdom about either<br>

<br>

i) should SHMEM suffice to gain concurrency from single multithreaded process, in a way that MPI does not<br>

<br></blockquote><div><br></div><div>SHMEM is not burdened with the semantics of MPI send-recv, but OpenSHMEM currently does not support threads (we are in the process of trying to define it).</div><div><br></div><div>As far as I know SHMEM for OFI (<a href="https://github.com/Sandia-OpenSHMEM/SOS/">https://github.com/Sandia-OpenSHMEM/SOS/</a>) is not going to give you thread-oriented concurrency.  If I'm wrong, someone on this list will correct me.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

ii) would it be better to drop to OFI (using the useful SHMEM tutorial on GitHub as an example)<br>

<br></blockquote><div><br></div><div>That is the right path forward, modulo some platform-specific details I'll share out-of-band.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

iii) Can job invocation rely on mpirun for job load, and use MPI (or OpenSHMEM) for exchange of network address, but then<br>

<br>

     a) safely fire up OFI endpoints and use them instead of MPI<br>

<br>

     b) safely fire up OFI endpoints and use them alongside and as well as MPI<br>

<br>

Advice appreciated.<br>

<span class="gmail-HOEnZb"><font color="#888888"><br></font></span></blockquote><div><br></div><div>You might find that the proposed OpenSHMEM extension for counting puts (<a href="http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentations_and_tutorials/Dinan_Reducing_Synchronization_Overhead_Through_Bundled_Communication.pdf">http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentations_and_tutorials/Dinan_Reducing_Synchronization_Overhead_Through_Bundled_Communication.pdf</a>, <a href="http://rd.springer.com/chapter/10.1007/978-3-319-05215-1_12">http://rd.springer.com/chapter/10.1007/978-3-319-05215-1_12</a>) that is implemented in <a href="https://github.com/Sandia-OpenSHMEM/SOS">https://github.com/Sandia-OpenSHMEM/SOS</a> is a reasonable approximation to the HW-based send-recv that you probably used on Blue Gene/Q via SPI.  You might be able to build off of that code base.</div><div><br></div><div>Best,</div><div><br></div><div>Jeff</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-HOEnZb"><font color="#888888">

Peter<br>

<br>

<br>

--<br>

The University of Edinburgh is a charitable body, registered in<br>

Scotland, with registration number SC005336.<br>

<br>

______________________________<wbr>_________________<br>

Libfabric-users mailing list<br>

<a href="mailto:Libfabric-users@lists.openfabrics.org">Libfabric-users@lists.<wbr>openfabrics.org</a><br>

<a href="http://lists.openfabrics.org/mailman/listinfo/libfabric-users" rel="noreferrer" target="_blank">http://lists.openfabrics.org/<wbr>mailman/listinfo/libfabric-<wbr>users</a><br>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div>

</div></div>