[libfabric-users] Queue size question

Biddiscombe, John A. john.biddiscombe at cscs.ch
Mon Mar 22 14:52:32 PDT 2021

Dear List,

I have a test that seems to run fine on tcp;ofi_rxm - though this test is two ranks on the same laptop, so it isn't really a very good test - however, I can throw anything at it and it seems to reliably complete.

On GNI, I get lockups and after much head scratching, I am wondering what the significance of the tx/rx attribute size may be.

On tcp/ofi_rxm the size reports as "size: 65536" and I can have 16 threads each sending up to 128 messages in flight on one thread per endpoint, and a single receive endpoint handling all receives - possibly 16*128 messages with posted receives = 2048.

When I run on GNI, using two nodes, each reports tx/rx attr "size: 500" - and I find that when many messages are in flight, things can lock up because some posted sends are never received. This seems to happen even when I drop down to 16 threads with 8 in flight messages which ought to be 128 at a time - and I would have suspected that a size of 500 (cq size limitation?) would handle this.

Question 1 - what is the tx/rx attr size really telling me?

Question 2 - if I post more than the allowed receives or sends, should I not receive some kind of error? (I have enabled resource management, so I might expect a retry code when I attempt the send/recv)

Ideally, I'd like to throttle the number of messages in flight according to what the hardware reports its capabilities - which vars should I use from the fi_info to do this?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210322/9d88e7de/attachment.htm>

More information about the Libfabric-users mailing list