[ofiwg] Progress in util_cq/util_cntr?
Byrne, John (Labs)
john.l.byrne at hpe.com
Mon Oct 7 08:34:15 PDT 2019
I see two issues:
1.) The synchronous calls won't execute the progress function once the they decide to sleep, so FI_PROGRESS_MANUAL does not work. This is absolutely true in util_cq, but not completely so in util_cntr since it sleeps in 50 ms intervals, but the 50 ms would be an unacceptable latency in any real use. Using the cntr wait loop as an example:
do {
cntr->progress(cntr);
if (threshold <= ofi_atomic_get64(&cntr->cnt))
return FI_SUCCESS;
if (errcnt != ofi_atomic_get64(&cntr->err))
return -FI_EAVAIL;
if (ofi_adjust_timeout(endtime, &timeout))
return -FI_ETIMEDOUT;
/*
* Temporary work-around to avoid a thread hanging in underlying
* epoll_wait called from fi_wait. This can happen if one thread
* updates the counter, another thread reads it (thereby resetting
* cntr signal fd) and the current thread is about to wait. The
* current thread would never wake up and doesn't know the counter
* has been updated. Fix it by checking counter state every now
* and then instead of waiting for a longer period. This does
* have the overhead of threads waking up unnecessarily.
*/
*/
timeout_quantum = (timeout < 0 ? OFI_TIMEOUT_QUANTUM_MS :
MIN(OFI_TIMEOUT_QUANTUM_MS, timeout));
ret = fi_wait(&cntr->wait->wait_fid, timeout_quantum);
} while (!ret || (ret == -FI_ETIMEDOUT &&
(timeout < 0 || timeout_quantum < timeout)));
Setting the timeout_quantum to 0 in the case cntr->domain->data_progress == FI_PROGRESS_MANUAL seems to be a quick fix, but it might be wiser to allow a wait function to be specified as part of the util_cntr/cq and let it deal with this.
2.) ofi_cq_read() only calls the progress function if the cq is empty or the count is zero.
if (ofi_cirque_isempty(cq->cirq) || !count) {
cq->cq_fastlock_release(&cq->cq_lock);
cq->progress(cq);
cq->cq_fastlock_acquire(&cq->cq_lock);
if (ofi_cirque_isempty(cq->cirq)) {
i = -FI_EAGAIN;
goto out;
}
}
While I actually like idea of trying to reduce gratuitous progress calls, this caused some code I had to break. I was doing a fi_recv(FI_PEEK) and getting ENOMSG returned immediately in the cq, because the queue was never empty, the progress function was never called and my loop hung indefinitely. I believe the progress function must be called unconditionally.
The util_cntr code does not have this issue, progress is called unconditionally in ofi_cntr_read().
John Byrne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofiwg/attachments/20191007/566f858a/attachment.html>
More information about the ofiwg
mailing list