[libfabric-users] GNI Provider manual control_progress
Howard Pritchard
hppritcha at gmail.com
Thu May 25 07:55:20 PDT 2017
HI Daniel,
Would you be able to write a small reproducer program for the problem you
are observing
and open an issue on ofwig/libfabric? Also, does your app work with the
sockets provider
and FI_PROGRESS_MANUAL? The sockets provider should work on your Cray
system.
I think you are being a bit overly strict about the meaning of progress
manual. Its not that
no data transfers can proceed until the app calls fi_cq_read and friends,
but rather
that the app must make such calls to guarantee forward progress of
previously posted
transactions it has made, as well as progression of data transfer
operations from other
processes that initiated data transfer operations to EPs it has set up.
In the case of systems like Cray XC aries, where the underlying network can
progress
RMA operations without software intervention (once they are posted to hw
queues), it makes sense
to kick the GNI provider internal progress engine to keep submitting RMA
requests to the
queues in the event there's a backlog of requests. The hw queues can only
take
a certain number of requests - hence the reason for a possible backlog.
Howard
2017-05-25 4:15 GMT-06:00 Daniel Deppisch <danieldeppisch at onlinehome.de>:
> Hello Howard,
>
> thank you for your reply!
>
> I have a small benchmarking app which gets stuck with
> control_progress=manual + data_progress=manual.
> It does however work perfectly with control_progress=auto +
> data_progress=manual.
> So I figured there would be control_progress missing.
>
> If memory registration is synchronous, my guess would be that maybe
> AV-inserts are not progressed? Are those calls synchronous too? If they
> aren't, how do I get them to progress?
> Or is there any other progress that might be missing?
>
> The app gets stuck at the first attempt to wait for a transfer operation
> on a CQ. AV-insert calls do return.
>
> On your question:
> I don't really understand what is going on yet, so I can't really
> recommend anything. All I can say is that for me things are not working the
> way the manual suggests.
> The manual tells me control_progress=manual is supported, but all ways to
> actually make that progress (at least the ones I can deduct from the
> manual) are not supported/implemented as the would involve the EQ.
> I probably wouldn't implement unneccessary "EQ action" but rather update
> the manual for the GNI provider to explain this.
>
> On your PR:
> "it causes all data progress to be delayed until the app starts calling
> fi_cq_read, etc."
> From what I read in the manual, this is exactly what I would expect from
> data_progress=manual.
>
> Another note:
> Since I mentioned updating the manual, I just want to add that in my
> opinion it would be nice to have the manual explain that for the GNI
> provider, setting wait_objects to FI_WAIT_NONE is perfectly fine, even if
> you want to use synchronous calls on queues and counters. This is rather
> unexpected behaviour, as the general documentation (on CQ for example)
> states that you actually can't use blocking calls with FI_WAIT_NONE. I only
> found this info somewhat by accident somewhere in the depths of the GNI
> provider fork.
>
>
> Thanks and regards,
> Daniel
>
>
> ------ Originalnachricht ------
> Von: "Howard Pritchard" <hppritcha at gmail.com>
> An: "Daniel Deppisch" <danieldeppisch at onlinehome.de>
> Cc: libfabric-users at lists.openfabrics.org
> Gesendet: 24.05.2017 17:16:47
> Betreff: Re: [libfabric-users] GNI Provider manual control_progress
>
> Hello Daniel,
>
> For the GNI provider, memory registration is currently a synchronous
> call. There is no need to poll
> on an EQ to progress memory registration. From your reading of the man
> pages, do you think
> that it would be less confusing if the GNI provider added support for
> using EQ events to progress
> memory registration even when its not necessary? It would not be a big
> problem to add such
> support.
>
> For data transfer operations however, you will indeed need to poll on the
> send/recv counters/CQs
> that have been bound to the EP initiating these data transfer operations.
>
> Just a heads up that we have at least one PR that needs to get upstream
> that may impact your testing:
>
> https://github.com/ofi-cray/libfabric-cray/pull/1347
>
> We should have that pushed upstream this week.
>
> Howard
>
>
> 2017-05-18 12:19 GMT-06:00 Daniel Deppisch <danieldeppisch at onlinehome.de>:
>
>> Hello,
>>
>> im a master student from germany currently evaluating libfabric
>> performance on a Cray Aries system.
>>
>> I am struggling to get control_progress=FI_PROGRESS_MANUAL working.
>> Documentation clearly states it is supported for the GNI provider.
>>
>> However, manual progress is described to be made when reading/waiting for
>> the queue/counter where the completion will be reported.
>> For control operations this should be the event queue.
>>
>> Now the GNI provider does not implement any of the functions for binding
>> an EQ.
>> Neither can I bind control operation completions (like av_insert()s) to a
>> CQ/counter using the cq_bind() flags.
>>
>> If manual control progress is supported, but the eq cannot be used, how
>> do I make manual control progress?
>>
>> Thanks and regards,
>> Daniel Deppisch
>>
>> _______________________________________________
>> Libfabric-users mailing list
>> Libfabric-users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170525/7eadd269/attachment.html>
More information about the Libfabric-users
mailing list