[libfabric-users] Not receiving messages from other ranks

Biddiscombe, John A. john.biddiscombe at cscs.ch
Fri Feb 12 12:42:17 PST 2021



>
I'm looking into this.  The checks in the tcp provider are too strict.  I have a patch that fixes that, but I doubt it will help here.  I'm still analyzing the rxm code to understand if it's passing the right capabilities to tcp and handling its checks correctly.
<

Thanks. Any time you want me to test a patch, I'm ready to go.

In case it helps. If I change the provider to sockets, the code runs, and messages are received now - though the tests fail due to validation of received data (wrong numbers being received) - not sure if this is a bug of mine or not. The exact same code using tcp;rxm does not receive messages.

I'm using FI_DIRECTED_RECEIVE because some ranks send data to themselves and use the same tag as data coming from other ranks
I'm using 64bit tags and I get weird (=I can't explain it well) results if I put some info like rank in the tag using bitmasks - then the sockets version stops working (but I have no idea why making the tags more different/unique would be an issue (however, mismatched tags would explain the data fail when tests are run with sockets - if the wrong tag is being matched, we get errors, but if I change the tags, nothing matches - should I only use 32bit tags?).

I'm trying to test on gni to see what errors I might get there, but it craps out on me at fi_getinfo, so I must have messed up my libfabric and I'm rebuilding everything again

I'm using v1.11.2 tag from the git repo

JB

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210212/3860be24/attachment.htm>


More information about the Libfabric-users mailing list