<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">
<p>Working!</p>
<p><br>
</p>
<p>Solution : Do not use FI_DIRECTED_RECV because the endpoint doing the sending does not match the (recv) endpoint in the address vector. Instead supplement that tags with some extra src/dst info that makes them more unique. I found the mistake in the tag
generation code.</p>
<p><br>
</p>
<p>An alternative solution might be to register 2 addresses for each rank in the address vector (say starting from N to 2N-1), use FI_DIRECTED_RECV and when receiving, use 2xRank as the receive address (or 2*rank-1) or whatever depending on how the endpoints
are added to the AV.</p>
<p><br>
</p>
<p>Anyway. I apologise for the noise on this list. I should have realized sooner that receiving from rank N doesn't mean anything if rank N has more than 1 endpoint and the address vector maps the endpoints to ranks.</p>
<p><br>
</p>
<p>JB</p>
<p><br>
</p>
</div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Libfabric-users <libfabric-users-bounces@lists.openfabrics.org> on behalf of Biddiscombe, John A. <john.biddiscombe@cscs.ch><br>
<b>Sent:</b> 15 February 2021 11:15:53<br>
<b>To:</b> Hefty, Sean; libfabric-users@lists.openfabrics.org<br>
<b>Subject:</b> Re: [libfabric-users] Not receiving messages from other ranks</font>
<div> </div>
</div>
<div>
<meta content="text/html; charset=UTF-8">
<style type="text/css" style="">
<!--
p
{margin-top:0;
margin-bottom:0}
-->
</style>
<div dir="ltr">
<div id="x_divtagdefaultwrapper" dir="ltr" style="font-size:12pt; color:#000000; font-family:Calibri,Helvetica,sans-serif">
<p>Sean</p>
<p><br>
</p>
<p>Thanks for taking the time to look into it.</p>
<p><br>
</p>
<p>I might have an idea what is going wrong.</p>
<p><br>
</p>
<p>When I use a send endpoint that is different from the receive endpoint<br>
</p>
<p><br>
</p>
<p></p>
<div>libfabric:217045:ofi_rxm:cq:rxm_cq_log_comp():924<debug> Reporting FI_SEND, FI_TAGGED completion<br>
libfabric:217045:ofi_rxm:cq:rxm_handle_recv_comp():801<debug> Got TAGGED op<br>
libfabric:217045:ofi_rxm:cq:rxm_match_rx_buf():762<debug> No matching recv found for incoming msg (fi_addr: 0xffffffffffffffff tag: 0xfbcc407000000000)<br>
libfabric:217045:ofi_rxm:cq:rxm_match_rx_buf():764<debug> Enqueueing msg to unexpected msg queue</div>
<div><br>
</div>
<p></p>
<p>You can see in this debug snippet that a send does not match to the pre-posted recv - in this case the tag is valid and a recv was posted with it, but because the send endpoint does not receive, I have not given it an address in the AV - so the recv sees
<span>fi_addr: 0xffffffffffffffff</span> and does not match it - if it came from the recv endpoint on that rank, the address would be correct and it'd match (I surmise)</p>
<p><br>
</p>
<p>If I remove the FI_DIRECTED flag, then the mismatch goes away, but I'm left with messages being received by the wrong buffer because the tags are used more than once since different ranks reuse the same tags.</p>
<p><br>
</p>
<p>When I extend my tags to contain the rank info, it stops working, but it must be due to a bad tag bitmasking operation on my part which I'm now looking at.</p>
<p><br>
</p>
<p>If there is any way of keeping the FI_DIRECTED and making the address work, then that would be great. But I suspect I'm out of luck ...</p>
<p><br>
</p>
<p>I will report back if I can fix the tag issue and the problem is solved.<br>
</p>
<p><br>
</p>
<p>JB<br>
</p>
<p><br>
</p>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Hefty, Sean <sean.hefty@intel.com><br>
<b>Sent:</b> 13 February 2021 02:50:09<br>
<b>To:</b> Hefty, Sean; Biddiscombe, John A.; libfabric-users@lists.openfabrics.org<br>
<b>Subject:</b> RE: Not receiving messages from other ranks</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:10pt;">
<div class="PlainText">> I'm looking into this. The checks in the tcp provider are too strict. I have a patch<br>
> that fixes that, but I doubt it will help here. I'm still analyzing the rxm code to<br>
> understand if it's passing the right capabilities to tcp and handling its checks<br>
> correctly.<br>
<br>
As an update, I haven't found anything in rxm that looks incorrect. The changes I made to the other providers have been merged into master. It's possible the fix to the tcp provider will eliminate the error you're seeing during enable.<br>
<br>
- Sean<br>
</div>
</span></font></div>
</body>
</html>