[libfabric-users] Queue size question

Wed Mar 24 05:40:35 PDT 2021

Hi John,

Could you point me to your test and instructions on how to reproduce this
problem?  We might get lucky and I'll be able to fix.

Howard

Am Mi., 24. März 2021 um 01:53 Uhr schrieb Biddiscombe, John A. <
john.biddiscombe at cscs.ch>:

> >Someone familiar with gni will need to chime in on how it maps to their
> HW.
>
> >
> Yes, you should see -FI_EAGAIN when trying to post more operations that
> the queues support.  There are checks like this in some providers -- I
> think rxm, verbs, and tcp all do, and rxm is actually forgiving about it by
> allowing queues to overflow.  (Because it's easy to swamp a receiver, even
> with a reasonably well-written app.)
> <
>
> >Resource management is the correct setting.  Manually limited your
> application to the tx/rx sizes, and sizing the CQ appropriately should have
> done the trick.
>
> >It sounds like this is a problem likely restricted to gni.
>
> When I get an FI_EAGAIN message during send or,recv I back off and
> progress the network and retry until the message goes through, so I do not
> believe I am overrunning the Tx or Rx CQ's - however the symptoms I am
> seeing are that I post N sends, but only receive (N-M) send completions. On
> the peer rank, I receive (N-M) recv completions and then my code hangs as
> both ranks poll forever since the sending rank knows that there are M sends
> that have not completed and the receiving rank knows that there are M
> receives waiting.
> The value off M changes randomly depending on the size of my test, but
> when I use larger messages it occurs more frequently than with small ones
> (contention for resources no doubt). The probability of a hang increases
> with message size and number of threads doing the sending/receiving. I
> believe I have ruled out the possibility of there being a problem with my
> code (extensive testing and checks for races etc).
>
> I'd like to find out what is going wrong as I believe libfabric might be
> losing some off my sends - Since a send can complete on this rank, without
> the receive on the other rank completing, the fact that I am missing send
> completions makes me think that something is being munged at the send end
> (as opposed to the Rx end).
>
> Now if it turns out that I have T threads all sending and each is adhering
> to it's allowed send CQ size, but the sum of all messages in flight at some
> moment is actually larger than the rx CQ size can handle, then I have a
> problem (is there any throttling mechanism I can use to make sure this
> never happens?) - but I have the problem that with 8 threads and 16 per
> thread = 128 messages, I still get lockups, even when the rx Cq size is 500
> - so I don't believe that I'm breaking the rules. (There still exists the
> possibility that I have 128 messages in flight, but that is counting send
> completions and the receiving end might not have actually handled them all,
> so although I think I have 128 in flight, I actually have 128 incomplete
> sends, and an unspecified number of messages that have been send, but not
> handled at the other end)
>
> I welcome any clues on how to debug this (turning on logging might not be
> an option since we're talking about a lot of messages and irregular hangs -
> Unless Is it possible to turn on debug logging for just some bits of
> libfabric that might help me? I'm using v1.12 branch from very recently).
>
> Thanks
>
> JB
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/libfabric-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210324/ad7e033d/attachment.htm>