[ofiwg] fi_[ec]q_readerr

Tue Oct 21 13:31:08 PDT 2014

The more I think about this whole *_readerr() API, the more it bothers me.  In general CQs and EQs simply hold return the delayed results from issuing asynchronous commands or operations.  What's the reason for differentiating "successful" results from "failed" results in such a dramatic way?

The man page says "EQs are optimized to report operations which have completed successfully."  This may be true, but I don't see why it necessitates a separate call for retrieving an error completion.  Jeff points out that this semantic of first calling fi_cq_read() and getting -FI_EAVAIL, then calling fi_cq_readerr() to get the error can be cumbersome in a multi-threaded environment.

It appears that part of the motivation for this is to keep the size of the buffer the user passes to [ec]q_read as small as possible.  Suppose that we add a pointer to an error structure in the completion struct, so the in the error case, the provider malloc()s data for this additional error data and returns a pointer to it.  This should satisfy the goal of keeping the size of the struct the user passes in small, but also allows the provider to return rich error information, all through the single [ce]q_read APIs.

[ec]q_read would still return -FI_EAVAIL (or equivalent) for an error completion, but this means "look in your completion for the error information" rather than "make another call"

This also allows order to be maintained between "successful" and "failed" completions, which is lost with the out-of-band error reporting.

Thoughts?

-reese

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofiwg/attachments/20141021/813e4c5d/attachment.html>