[libfabric-users] Not receiving messages from other ranks

Biddiscombe, John A. john.biddiscombe at cscs.ch
Mon Feb 15 03:54:11 PST 2021


Solution : Do not use FI_DIRECTED_RECV because the endpoint doing the sending does not match the (recv) endpoint in the address vector. Instead supplement that tags with some extra src/dst info that makes them more unique. I found the mistake in the tag generation code.

An alternative solution might be to register 2 addresses for each rank in the address vector (say starting from N to 2N-1), use FI_DIRECTED_RECV and when receiving, use 2xRank as the receive address (or 2*rank-1) or whatever depending on how the endpoints are added to the AV.

Anyway. I apologise for the noise on this list. I should have realized sooner that receiving from rank N doesn't mean anything if rank N has more than 1 endpoint and the address vector maps the endpoints to ranks.


From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <john.biddiscombe at cscs.ch>
Sent: 15 February 2021 11:15:53
To: Hefty, Sean; libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Not receiving messages from other ranks


Thanks for taking the time to look into it.

I might have an idea what is going wrong.

When I use a send endpoint that is different from the receive endpoint

libfabric:217045:ofi_rxm:cq:rxm_cq_log_comp():924<debug> Reporting FI_SEND, FI_TAGGED completion
libfabric:217045:ofi_rxm:cq:rxm_handle_recv_comp():801<debug> Got TAGGED op
libfabric:217045:ofi_rxm:cq:rxm_match_rx_buf():762<debug> No matching recv found for incoming msg (fi_addr: 0xffffffffffffffff tag: 0xfbcc407000000000)
libfabric:217045:ofi_rxm:cq:rxm_match_rx_buf():764<debug> Enqueueing msg to unexpected msg queue

You can see in this debug snippet that a send does not match to the pre-posted recv - in this case the tag is valid and a recv was posted with it, but because the send endpoint does not receive, I have not given it an address in the AV - so the recv sees fi_addr: 0xffffffffffffffff and does not match it - if it came from the recv endpoint on that rank, the address would be correct and it'd match (I surmise)

If I remove the FI_DIRECTED flag, then the mismatch goes away, but I'm left with messages being received by the wrong buffer because the tags are used more than once since different ranks reuse the same tags.

When I extend my tags to contain the rank info, it stops working, but it must be due to a bad tag bitmasking operation on my part which I'm now looking at.

If there is any way of keeping the FI_DIRECTED and making the address work, then that would be great. But I suspect I'm out of luck ...

I will report back if I can fix the tag issue and the problem is solved.


From: Hefty, Sean <sean.hefty at intel.com>
Sent: 13 February 2021 02:50:09
To: Hefty, Sean; Biddiscombe, John A.; libfabric-users at lists.openfabrics.org
Subject: RE: Not receiving messages from other ranks

> I'm looking into this.  The checks in the tcp provider are too strict.  I have a patch
> that fixes that, but I doubt it will help here.  I'm still analyzing the rxm code to
> understand if it's passing the right capabilities to tcp and handling its checks
> correctly.

As an update, I haven't found anything in rxm that looks incorrect.  The changes I made to the other providers have been merged into master.  It's possible the fix to the tcp provider will eliminate the error you're seeing during enable.

- Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210215/7d24c256/attachment.htm>

More information about the Libfabric-users mailing list