<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Aptos;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Aptos",sans-serif;
mso-ligatures:standardcontextual;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#467886;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Aptos",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:11.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:77555621;
mso-list-type:hybrid;
mso-list-template-ids:-744864724 820695664 67698691 67698693 67698689 67698691 67698693 67698689 67698691 67698693;}
@list l0:level1
{mso-level-start-at:0;
mso-level-number-format:bullet;
mso-level-text:-;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Aptos",sans-serif;
mso-fareast-font-family:Aptos;
mso-bidi-font-family:"Times New Roman";}
@list l0:level2
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level3
{mso-level-number-format:bullet;
mso-level-text:\F0A7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
@list l0:level4
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l0:level5
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level6
{mso-level-number-format:bullet;
mso-level-text:\F0A7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
@list l0:level7
{mso-level-number-format:bullet;
mso-level-text:\F0B7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Symbol;}
@list l0:level8
{mso-level-number-format:bullet;
mso-level-text:o;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:"Courier New";}
@list l0:level9
{mso-level-number-format:bullet;
mso-level-text:\F0A7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#467886" vlink="#96607D" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">09/03/2024<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b><u>Participants:<o:p></o:p></u></b></p>
<p class="MsoNormal">Alexia Ingerson (Intel)<o:p></o:p></p>
<p class="MsoNormal">Jianxin Xiong (Intel)<o:p></o:p></p>
<p class="MsoNormal">Ben Lynam (Cornelis)<o:p></o:p></p>
<p class="MsoNormal">Bob Cernohous (Cornelis)<o:p></o:p></p>
<p class="MsoNormal">Charles Sherada (Cornelis)<o:p></o:p></p>
<p class="MsoNormal">Chuck Fossen<o:p></o:p></p>
<p class="MsoNormal">Ian Ziemba (HPE)<o:p></o:p></p>
<p class="MsoNormal">Jack Morrison (Cornelis)<o:p></o:p></p>
<p class="MsoNormal">Jerome Soumagne<o:p></o:p></p>
<p class="MsoNormal">Jessie Yang (AWS)<o:p></o:p></p>
<p class="MsoNormal">John Byrne (HPE)<o:p></o:p></p>
<p class="MsoNormal">Ken Raffenetti (ANL)<o:p></o:p></p>
<p class="MsoNormal">Peinan Zhang (Intel)<o:p></o:p></p>
<p class="MsoNormal">Rajalaxmi (Intel)<o:p></o:p></p>
<p class="MsoNormal">Shi Jin (AWS)<o:p></o:p></p>
<p class="MsoNormal">Stephen Oost (Intel)<o:p></o:p></p>
<p class="MsoNormal">Steven Welch (HPE)<o:p></o:p></p>
<p class="MsoNormal">Zach Dworkin (Intel)<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b><u>Summary:<o:p></o:p></u></b></p>
<p class="MsoNormal">Libfabric 2.0.0-alpha was released on 8/30 and includes all deprecations and a new provider (LPP). Beta release is targeted for 2 months from now (end of October). Please check new alpha release for any issues, especially compatibility
issues for building. No new 1.x branches are expected though 1.x.y releases may occur as needed.<o:p></o:p></p>
<p class="MsoNormal">AWS presented on proposals to expand FI_HMEM interface capabilities. AWS, Intel, and HPE agree that adding fi_info->hmem_attr would be the best solution. AWS will draft up a more detailed proposal for this which includes adding features
such as which interfaces support p2p (NIC to GPU) or optimized device memcopies.<o:p></o:p></p>
<p class="MsoNormal">Cornelis proposed adding an opx-specific yaml file to trigger their internal testing. There were no objections.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b><u>Notes:<o:p></o:p></u></b></p>
<p class="MsoNormal">Libfabric 2.0.0-alpha released 8/30<o:p></o:p></p>
<ul style="margin-top:0in" type="disc">
<li class="MsoNormal" style="mso-list:l0 level1 lfo1">Deprecations<o:p></o:p></li><li class="MsoNormal" style="mso-list:l0 level1 lfo1">New LPP provider<o:p></o:p></li></ul>
<p class="MsoNormal">Please check new alpha release for any issues, especially compatibility issues for building (added warnings for deprecated features) as well as running.<o:p></o:p></p>
<p class="MsoNormal">Beta release planned for 2 months from now (end of October)<o:p></o:p></p>
<ul style="margin-top:0in" type="disc">
<li class="MsoNormal" style="mso-list:l0 level1 lfo1">Expected all new features will make it into beta release<o:p></o:p></li></ul>
<p class="MsoNormal">Q: What’s the plan if need new 1.x release? Can we create new 1.x branch?<o:p></o:p></p>
<p class="MsoNormal">A: We can still do 1.x.y releases for bug fixes but shouldn’t need new minor release<o:p></o:p></p>
<p class="MsoNormal">NCCL plug in currently cannot build with upstream, may need a fix for build issue.<o:p></o:p></p>
<p class="MsoNormal">Q: Are deprecated features going to be removed for official 2.0 release?<o:p></o:p></p>
<p class="MsoNormal">A: Will stick around for about a year before removal. Will take time for middlewares to officially remove use of deprecated features.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">HMEM capability refinement (AWS presentation – Jessie, Shi)<o:p></o:p></p>
<p class="MsoNormal">Currently FI_HMEM is on/off capability which represents many interfaces<o:p></o:p></p>
<p class="MsoNormal">Proposal to introduce more capabilities to manage more specific FI_HMEM abilities<o:p></o:p></p>
<p class="MsoNormal">Add struct fi_hmem_attr *hmem_attr to info to return specific set of capabilities<o:p></o:p></p>
<p class="MsoNormal">struct fi_hmem_attr {<o:p></o:p></p>
<p class="MsoNormal"> enum fi_hmem_iface iface;<o:p></o:p></p>
<p class="MsoNormal"> bool use_p2p;<o:p></o:p></p>
<p class="MsoNormal"> bool use_dev_reg_copy;<o:p></o:p></p>
<p class="MsoNormal"> bool api_permitted:<o:p></o:p></p>
<p class="MsoNormal"> struct fi_hmem_attr *next;<o:p></o:p></p>
<p class="MsoNormal">};<o:p></o:p></p>
<p class="MsoNormal">iface: memory interface type (CUDA, ROCR, ZE)<o:p></o:p></p>
<p class="MsoNormal">use_p2p (accelerator to NIC p2p, not acc-acc p2p): whether peer to peer transfers should be used, filter out shm, all apps to specific p2p early<o:p></o:p></p>
<p class="MsoNormal">use_dev_reg_copy: whether to use optimized memcpy for dev memory (ie GDR)<o:p></o:p></p>
<p class="MsoNormal">api_permitted: whether dev specific API call is allowed, prevents unsafe operations and resource management conflicts<o:p></o:p></p>
<p class="MsoNormal">next: pointer to next hmem_attr if using multiple non-system ifaces<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Struct will be both input and output – app can set fields to request settings and used to filter and select certain ifaces, etc<o:p></o:p></p>
<p class="MsoNormal">Q: Is it hard for a provider to onboard this new API because it’s not using common code?<o:p></o:p></p>
<p class="MsoNormal">A: Could do this either way. Most of the information is in common code. A provider if they support certain attributes then it should go through common code. Use_p2p could be the only field not in common code.<o:p></o:p></p>
<p class="MsoNormal">Q: use_p2p should be clearer. It means provider should use p2p or allowed to use p2p?<o:p></o:p></p>
<p class="MsoNormal">A: It should follow the current common fi_use_p2p preferred or disabled where each one already has its defined behavior.<o:p></o:p></p>
<p class="MsoNormal">Api_permitted is definitely input value only – weird use case of fi_info attr. Very unclear – provider has to use API to use accelerator library. Even FI_HMEM initialization uses API code. Main use case for api_permitted is for MR registration
and data transfer calls. Need to revisit this option.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Alternative: add fields to fi_mr_attr on memory registration path<o:p></o:p></p>
<p class="MsoNormal">struct fi_mr_attr {<o:p></o:p></p>
<p class="MsoNormal"> //existing fields<o:p></o:p></p>
<p class="MsoNormal"> bool use_p2p;<o:p></o:p></p>
<p class="MsoNormal"> bool use_dev_reg_copy;<o:p></o:p></p>
<p class="MsoNormal"> bool api_permitted;<o:p></o:p></p>
<p class="MsoNormal">};<o:p></o:p></p>
<p class="MsoNormal">Pros: no need for users to set values<o:p></o:p></p>
<p class="MsoNormal">Cons: scope is limited to MR registration, users cannot see these fields by running fi_info<o:p></o:p></p>
<p class="MsoNormal">Preference is fi_info path to see and set provider specific settings, wider use case<o:p></o:p></p>
<p class="MsoNormal">HPE, AWS, and Intel all prefer having it at the fi_info level<o:p></o:p></p>
<p class="MsoNormal">Q: any comments on input/output p2p settings<o:p></o:p></p>
<p class="MsoNormal">A: api_permitted setting is very important to them. Need to think more on input/output settings<o:p></o:p></p>
<p class="MsoNormal">AWS will continue to refine proposal for adding to fi_info and we will revisit<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Cornelis Github Action workflow (Jack Morrison):<o:p></o:p></p>
<p class="MsoNormal">Go over cn.yaml addition in new PR <a href="https://github.com/ofiwg/libfabric/pull/10354">
github.com/ofiwg/libfabric/pull/10354</a><o:p></o:p></p>
<p class="MsoNormal">Trying to leverage more utilities available as part of Github Actions for opx testing.<o:p></o:p></p>
<ul style="margin-top:0in" type="disc">
<li class="MsoNormal" style="mso-list:l0 level1 lfo1">In cn.yaml – check to see if the PR is targeted for internal Cornelis libfabric repository. No-op if not targeted for internal repo<o:p></o:p></li></ul>
<p class="MsoNormal">Runs on internal Cornelis machines<o:p></o:p></p>
<p class="MsoNormal">Will not get triggered for non-Cornelis/upstream PRs<br>
Q: Why not have this only internal? What’s the benefit of having it upstream?<o:p></o:p></p>
<p class="MsoNormal">A: Makes it easier to handle commits because of rebasing/upstreaming flow<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">No objections. Fine to move forward<o:p></o:p></p>
</div>
</body>
</html>