[ofiwg] Progress in util_cq/util_cntr?

Byrne, John (Labs) john.l.byrne at hpe.com
Mon Oct 7 08:34:15 PDT 2019


I see two issues:


1.)    The synchronous calls won't execute the progress function once the they decide to sleep, so FI_PROGRESS_MANUAL does not work. This is absolutely true in util_cq, but not completely so in util_cntr since it sleeps in 50 ms intervals, but the 50 ms would be an unacceptable latency in any real use. Using the cntr wait loop as an example:

do {
                cntr->progress(cntr);
                if (threshold <= ofi_atomic_get64(&cntr->cnt))
                        return FI_SUCCESS;

                if (errcnt != ofi_atomic_get64(&cntr->err))
                        return -FI_EAVAIL;

                if (ofi_adjust_timeout(endtime, &timeout))
                        return -FI_ETIMEDOUT;

                /*
                 * Temporary work-around to avoid a thread hanging in underlying
                 * epoll_wait called from fi_wait. This can happen if one thread
                 * updates the counter, another thread reads it (thereby resetting
                 * cntr signal fd) and the current thread is about to wait. The
                 * current thread would never wake up and doesn't know the counter
                 * has been updated. Fix it by checking counter state every now
                 * and then instead of waiting for a longer period. This does
                 * have the overhead of threads waking up unnecessarily.
                 */
                 */
                timeout_quantum = (timeout < 0 ? OFI_TIMEOUT_QUANTUM_MS :
                                   MIN(OFI_TIMEOUT_QUANTUM_MS, timeout));

                ret = fi_wait(&cntr->wait->wait_fid, timeout_quantum);
     } while (!ret || (ret == -FI_ETIMEDOUT &&
              (timeout < 0 || timeout_quantum < timeout)));

Setting the timeout_quantum to 0 in the case cntr->domain->data_progress == FI_PROGRESS_MANUAL seems to be a quick fix, but it might be wiser to allow a wait function to be specified as part of the util_cntr/cq and let it deal with this.


2.)    ofi_cq_read() only calls the progress function if the cq is empty or the count is zero.


        if (ofi_cirque_isempty(cq->cirq) || !count) {
                cq->cq_fastlock_release(&cq->cq_lock);
                cq->progress(cq);
                cq->cq_fastlock_acquire(&cq->cq_lock);
                if (ofi_cirque_isempty(cq->cirq)) {
                        i = -FI_EAGAIN;
                        goto out;
                }
        }

While I actually like idea of trying to reduce gratuitous progress calls,  this caused some code I had to break. I was doing a fi_recv(FI_PEEK) and getting ENOMSG returned immediately in the cq, because the queue was never empty, the progress function was never called and my loop hung indefinitely. I believe the progress function must be called unconditionally.

The util_cntr code does not have this issue, progress is called unconditionally in ofi_cntr_read().


John Byrne







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofiwg/attachments/20191007/566f858a/attachment.html>


More information about the ofiwg mailing list