[libfabric-users] utility provider breaks fi_wait()
Chuck Cranor
chuck at ece.cmu.edu
Mon Sep 14 06:40:21 PDT 2020
hi-
I've run into a problem with the fi_wait(waitset,timeout)
call and the way that libfabric stacks the RXD utility provider
on top of other providers. On our system this issue causes
fi_wait() to block for the full timeout value, even if network
progress could be made.
The issue is this:
- providers like PSM, PSM2, and GNI have their own
version of the "wait_open" function in their fi_ops_fabric
structure. (see: prov/psm/src/psmx_wait.c,
prov/psm2/src/psmx2_wait.c, or prov/gni/src/gnix_wait.c)
- these "wait_open" function create a wait object and
init the "wait" function pointer in their "fi_ops_wait"
struct to point to a provider specific wait function
(psmx_wait_wait(), psmx2_wait_wait(), gnix_wait_wait()).
The intent here appears to be to route all wait operations
for a provider through this *_wait_wait() function to allow
it to do additional setup/teardown operations before/after
the main wait operation (the main wait operation being something
like a linux epoll_wait() system call via util_wait_fd_run()).
- the RXD utility provider sets its "wait_open" to point
to the generic ofi_wait_fd_open() function and does not
provide any "wait" functions for the "fi_ops_wait" structure
(so RXD waits go directly to the util_wait_fd_run() function).
- RXD stores pointers to the provider it is layered on top of in
in various RXD structures:
rxd_fabric->dg_fabric, rxd_domain->dg_domain, rxd_ep->dg_cq
- when rxd_ep_bind() is called in the FI_CLASS_CQ case,
it does this (see "rxd_ep.c" line 753):
if (!ep->dg_cq) {
/* done by the rxd_dg_cq_open() helper function */
-> call fi_cq_open() on dg_domain, store cq in ep->dg_cq
-> call fi_control(FI_GETWAIT) on &rxd_ep->dg_cq->fid
to get the dg_cq's wait file descriptor and save in
rxd_ep->dg_cq_fd
/* end of rxd_dg_cq_open() helper function */
}
ofi_wait_add_fd(cq->wait /*RXD's CQ*/,
ep->dg_cq_fd /* underlying dg_domain's CQ fd */,
POLLIN, rxd_ep_trywait, ep,
&ep->util_ep.ep_fid.fid);
RXD seems to be extracting the epoll_wait() file descriptor
from the wait object of the provider it is layered on top of
(the dg_domain), and then adding the dg_domain's file descriptor
to RXD's wait object.
I believe the thinking here is that adding the wait
file descriptor from the underlying layer's wait object
to RXD's wait object allows the RXD's wait object to handle
waits for both RXD and the provider it is layered on top
of just using RXD's wait object. Unfortunately, this
is a bad assumption and it does not work.
Here's the problem: when an application using RXD calls fi_wait()
it goes directly to the util_wait_fd_run() function and blocks
in epoll_wait() without ever calling the underlying layer's
"wait" function from the "fi_ops_wait" structure (i.e. psmx_wait_wait(),
psmx2_wait_wait(), gnix_wait_wait(), ... are never called!).
So the additional setup/tearmdown operations that those *wait_wait()
functions do gets skipped when using RXD. e.g. in the case of PSM,
psmx_wait_wait() creates a transient progress thread to drive progress
while the main thread calling fi_wait() is blocked. Without that thread,
no progress occurs and the application blocks until the fi_wait()
timeout fires.
What is the best way to resolve this issue?
Note that I'm using FI_PROGRESS_MANUAL mode (required by RXD)
and the libfabric backend of the Mercury RPC library
( https://mercury-hpc.github.io/ ) with PSM.
chuck
More information about the Libfabric-users
mailing list