[libfabric-users] utility provider breaks fi_wait()

Mon Sep 14 06:40:21 PDT 2020

hi-

    I've run into a problem with the fi_wait(waitset,timeout)
call and the way that libfabric stacks the RXD utility provider 
on top of other providers.  On our system this issue causes
fi_wait() to block for the full timeout value, even if network
progress could be made.

    The issue is this:

      - providers like PSM, PSM2, and GNI have their own
        version of the "wait_open" function in their fi_ops_fabric
        structure.   (see: prov/psm/src/psmx_wait.c, 
        prov/psm2/src/psmx2_wait.c, or prov/gni/src/gnix_wait.c)

      - these "wait_open" function create a wait object and
        init the "wait" function pointer in their "fi_ops_wait"
        struct to point to a provider specific wait function
        (psmx_wait_wait(), psmx2_wait_wait(), gnix_wait_wait()).

        The intent here appears to be to route all wait operations
        for a provider through this *_wait_wait() function to allow
        it to do additional setup/teardown operations before/after 
        the main wait operation (the main wait operation being something 
        like a linux epoll_wait() system call via util_wait_fd_run()).

      - the RXD utility provider sets its "wait_open" to point
        to the generic ofi_wait_fd_open() function and does not
        provide any "wait" functions for the "fi_ops_wait" structure
        (so RXD waits go directly to the util_wait_fd_run() function).

      - RXD stores pointers to the provider it is layered on top of in 
        in various RXD structures:
           rxd_fabric->dg_fabric, rxd_domain->dg_domain, rxd_ep->dg_cq

      - when rxd_ep_bind() is called in the FI_CLASS_CQ case,
        it does this (see "rxd_ep.c" line 753):

            if (!ep->dg_cq) {
               /* done by the rxd_dg_cq_open() helper function */

                -> call fi_cq_open() on dg_domain, store cq in ep->dg_cq

                -> call fi_control(FI_GETWAIT) on &rxd_ep->dg_cq->fid
                   to get the dg_cq's wait file descriptor and save in
                   rxd_ep->dg_cq_fd

               /* end of rxd_dg_cq_open() helper function */
            }

            ofi_wait_add_fd(cq->wait /*RXD's CQ*/,
                            ep->dg_cq_fd /* underlying dg_domain's CQ fd */,
                            POLLIN, rxd_ep_trywait, ep,
                            &ep->util_ep.ep_fid.fid);

        RXD seems to be extracting the epoll_wait() file descriptor
        from the wait object of the provider it is layered on top of
        (the dg_domain), and then adding the dg_domain's file descriptor
        to RXD's wait object.

        I believe the thinking here is that adding the wait
        file descriptor from the underlying layer's wait object
        to RXD's wait object allows the RXD's wait object to handle 
        waits for both RXD and the provider it is layered on top
        of just using RXD's wait object.   Unfortunately, this
        is a bad assumption and it does not work.

Here's the problem: when an application using RXD calls fi_wait()
it goes directly to the util_wait_fd_run() function and blocks
in epoll_wait() without ever calling the underlying layer's 
"wait" function from the "fi_ops_wait" structure (i.e. psmx_wait_wait(), 
psmx2_wait_wait(), gnix_wait_wait(), ... are never called!).

So the additional setup/tearmdown operations that those *wait_wait() 
functions do gets skipped when using RXD.   e.g. in the case of PSM, 
psmx_wait_wait() creates a transient progress thread to drive progress 
while the main thread calling fi_wait() is blocked.  Without that thread,
no progress occurs and the application blocks until the fi_wait()
timeout fires.

What is the best way to resolve this issue?

Note that I'm using FI_PROGRESS_MANUAL mode (required by RXD)
and the libfabric backend of the Mercury RPC library 
( https://mercury-hpc.github.io/ ) with PSM.

chuck