[libfabric-users] Queue size question
Biddiscombe, John A.
john.biddiscombe at cscs.ch
Wed Mar 24 00:53:11 PDT 2021
>Someone familiar with gni will need to chime in on how it maps to their HW.
Yes, you should see -FI_EAGAIN when trying to post more operations that the queues support. There are checks like this in some providers -- I think rxm, verbs, and tcp all do, and rxm is actually forgiving about it by allowing queues to overflow. (Because it's easy to swamp a receiver, even with a reasonably well-written app.)
>Resource management is the correct setting. Manually limited your application to the tx/rx sizes, and sizing the CQ appropriately should have done the trick.
>It sounds like this is a problem likely restricted to gni.
When I get an FI_EAGAIN message during send or,recv I back off and progress the network and retry until the message goes through, so I do not believe I am overrunning the Tx or Rx CQ's - however the symptoms I am seeing are that I post N sends, but only receive (N-M) send completions. On the peer rank, I receive (N-M) recv completions and then my code hangs as both ranks poll forever since the sending rank knows that there are M sends that have not completed and the receiving rank knows that there are M receives waiting.
The value off M changes randomly depending on the size of my test, but when I use larger messages it occurs more frequently than with small ones (contention for resources no doubt). The probability of a hang increases with message size and number of threads doing the sending/receiving. I believe I have ruled out the possibility of there being a problem with my code (extensive testing and checks for races etc).
I'd like to find out what is going wrong as I believe libfabric might be losing some off my sends - Since a send can complete on this rank, without the receive on the other rank completing, the fact that I am missing send completions makes me think that something is being munged at the send end (as opposed to the Rx end).
Now if it turns out that I have T threads all sending and each is adhering to it's allowed send CQ size, but the sum of all messages in flight at some moment is actually larger than the rx CQ size can handle, then I have a problem (is there any throttling mechanism I can use to make sure this never happens?) - but I have the problem that with 8 threads and 16 per thread = 128 messages, I still get lockups, even when the rx Cq size is 500 - so I don't believe that I'm breaking the rules. (There still exists the possibility that I have 128 messages in flight, but that is counting send completions and the receiving end might not have actually handled them all, so although I think I have 128 in flight, I actually have 128 incomplete sends, and an unspecified number of messages that have been send, but not handled at the other end)
I welcome any clues on how to debug this (turning on logging might not be an option since we're talking about a lot of messages and irregular hangs - Unless Is it possible to turn on debug logging for just some bits of libfabric that might help me? I'm using v1.12 branch from very recently).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users