[libfabric-users] Not receiving messages from other ranks
Biddiscombe, John A.
john.biddiscombe at cscs.ch
Mon Feb 15 02:15:53 PST 2021
Thanks for taking the time to look into it.
I might have an idea what is going wrong.
When I use a send endpoint that is different from the receive endpoint
libfabric:217045:ofi_rxm:cq:rxm_cq_log_comp():924<debug> Reporting FI_SEND, FI_TAGGED completion
libfabric:217045:ofi_rxm:cq:rxm_handle_recv_comp():801<debug> Got TAGGED op
libfabric:217045:ofi_rxm:cq:rxm_match_rx_buf():762<debug> No matching recv found for incoming msg (fi_addr: 0xffffffffffffffff tag: 0xfbcc407000000000)
libfabric:217045:ofi_rxm:cq:rxm_match_rx_buf():764<debug> Enqueueing msg to unexpected msg queue
You can see in this debug snippet that a send does not match to the pre-posted recv - in this case the tag is valid and a recv was posted with it, but because the send endpoint does not receive, I have not given it an address in the AV - so the recv sees fi_addr: 0xffffffffffffffff and does not match it - if it came from the recv endpoint on that rank, the address would be correct and it'd match (I surmise)
If I remove the FI_DIRECTED flag, then the mismatch goes away, but I'm left with messages being received by the wrong buffer because the tags are used more than once since different ranks reuse the same tags.
When I extend my tags to contain the rank info, it stops working, but it must be due to a bad tag bitmasking operation on my part which I'm now looking at.
If there is any way of keeping the FI_DIRECTED and making the address work, then that would be great. But I suspect I'm out of luck ...
I will report back if I can fix the tag issue and the problem is solved.
From: Hefty, Sean <sean.hefty at intel.com>
Sent: 13 February 2021 02:50:09
To: Hefty, Sean; Biddiscombe, John A.; libfabric-users at lists.openfabrics.org
Subject: RE: Not receiving messages from other ranks
> I'm looking into this. The checks in the tcp provider are too strict. I have a patch
> that fixes that, but I doubt it will help here. I'm still analyzing the rxm code to
> understand if it's passing the right capabilities to tcp and handling its checks
As an update, I haven't found anything in rxm that looks incorrect. The changes I made to the other providers have been merged into master. It's possible the fix to the tcp provider will eliminate the error you're seeing during enable.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users