[libfabric-users] Queue size question
Biddiscombe, John A.
john.biddiscombe at cscs.ch
Mon Mar 22 14:52:32 PDT 2021
I have a test that seems to run fine on tcp;ofi_rxm - though this test is two ranks on the same laptop, so it isn't really a very good test - however, I can throw anything at it and it seems to reliably complete.
On GNI, I get lockups and after much head scratching, I am wondering what the significance of the tx/rx attribute size may be.
On tcp/ofi_rxm the size reports as "size: 65536" and I can have 16 threads each sending up to 128 messages in flight on one thread per endpoint, and a single receive endpoint handling all receives - possibly 16*128 messages with posted receives = 2048.
When I run on GNI, using two nodes, each reports tx/rx attr "size: 500" - and I find that when many messages are in flight, things can lock up because some posted sends are never received. This seems to happen even when I drop down to 16 threads with 8 in flight messages which ought to be 128 at a time - and I would have suspected that a size of 500 (cq size limitation?) would handle this.
Question 1 - what is the tx/rx attr size really telling me?
Question 2 - if I post more than the allowed receives or sends, should I not receive some kind of error? (I have enabled resource management, so I might expect a retry code when I attempt the send/recv)
Ideally, I'd like to throttle the number of messages in flight according to what the hardware reports its capabilities - which vars should I use from the fi_info to do this?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users