[libfabric-users] Not receiving messages from other ranks
Biddiscombe, John A.
john.biddiscombe at cscs.ch
Fri Feb 12 12:42:17 PST 2021
I'm looking into this. The checks in the tcp provider are too strict. I have a patch that fixes that, but I doubt it will help here. I'm still analyzing the rxm code to understand if it's passing the right capabilities to tcp and handling its checks correctly.
Thanks. Any time you want me to test a patch, I'm ready to go.
In case it helps. If I change the provider to sockets, the code runs, and messages are received now - though the tests fail due to validation of received data (wrong numbers being received) - not sure if this is a bug of mine or not. The exact same code using tcp;rxm does not receive messages.
I'm using FI_DIRECTED_RECEIVE because some ranks send data to themselves and use the same tag as data coming from other ranks
I'm using 64bit tags and I get weird (=I can't explain it well) results if I put some info like rank in the tag using bitmasks - then the sockets version stops working (but I have no idea why making the tags more different/unique would be an issue (however, mismatched tags would explain the data fail when tests are run with sockets - if the wrong tag is being matched, we get errors, but if I change the tags, nothing matches - should I only use 32bit tags?).
I'm trying to test on gni to see what errors I might get there, but it craps out on me at fi_getinfo, so I must have messed up my libfabric and I'm rebuilding everything again
I'm using v1.11.2 tag from the git repo
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users