<div dir="ltr"><div>In my non-expert opinion, OFI is already providing the right abstraction for multi-rail situations in the form of domains:</div><div><br></div>"Domains usually map to a specific local network interface adapter. A domain may either refer to the entire NIC, a port on a multi-port NIC, or a virtual device exposed by a NIC. From the viewpoint of the application, a domain identifies a set of resources that may be used together." (<a href="https://github.com/ofiwg/ofi-guide/blob/master/OFIGuide.md">https://github.com/ofiwg/ofi-guide/blob/master/OFIGuide.md</a>)<div><br></div><div>From this, MPI libraries and the like would then need to support multiple domains.</div><div><br></div><div>Jeff<br><br>On Fri, Jun 2, 2017 at 12:21 PM, Hefty, Sean <<a href="mailto:sean.hefty@intel.com">sean.hefty@intel.com</a>> wrote:<br>><br>> Copying libfabric-users mailing list on this message.<br>><br>> Daniel, would you be able to join an ofiwg call to discuss these in more detail?  The calls are every other Tuesday from 9-10 PST, with the next call on Tuesday the 6th.<br>><br>> - Sean<br>><br>> > We work with HPC systems that deploy same but multiple network<br>> > adapters (including Intel OmniPath and MLX infiniband adapters) on<br>> > compute nodes.<br>> ><br>> > Over time, we encountered two issues which we believe can be addressed<br>> > by OFI library.<br>> ><br>> > First, a number of MPI implementations assume homogenous SW/HW setup<br>> > on all compute nodes.  For example, assume nodes with 2 adapters and 2<br>> > separate networks. Some MPI implementations assume that network<br>> > adapter A resides on CPU socket 0 on all nodes and connect to network<br>> > 0; and network adapter B resides on CPU socket 1 and connect to<br>> > network 1.  Unfortunately that is not always the case.  There are<br>> > systems where some nodes use adapter A to connect to network 0 and<br>> > others use adapter B to connect to network 0.  Same for network 1,<br>> > where we have mixed (crossed) adapters connected to same network.  In<br>> > such cases, MPII and lower layers cannot establish peer to peer<br>> > connection.  The best way  to solve this is to use the network subnet<br>> > ID to establish connection between pairs.  When there are multiple<br>> > networks and subnetwork IDs, mpirun would specify a network ID<br>> > (Platform MPI does this) and then the software can figure out from the<br>> > subnet ID what adapter each node is using to connect to such network.<br>> > Instead of implementing this logic in each MPI, it would be great if<br>> > OFI implements this logic since it is a one stop shop over all network<br>> > devices and providers.<br>> ><br>> > Second, multirail support is a hit and miss across MPI<br>> > implementations.  Intel Omnipath PSM2 library actually did a great job<br>> > here by implementing multirail support at the PSM2 level. This means<br>> > all above layers like MPI would get this functionality for free.<br>> > Again, given that many MPI implementation can be built on top of OFI,<br>> > It would be also great if OFI has multirail support.<br>> ><br>> > Thank you<br>> > Daniel Faraj<br>> _______________________________________________<br>> Libfabric-users mailing list<br>> <a href="mailto:Libfabric-users@lists.openfabrics.org">Libfabric-users@lists.openfabrics.org</a><br>> <a href="http://lists.openfabrics.org/mailman/listinfo/libfabric-users">http://lists.openfabrics.org/mailman/listinfo/libfabric-users</a><br><br><br><br><br>--<br>Jeff Hammond<br><a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a><br><div class="gmail_extra">

</div></div></div>