[libfabric-users] [chuck at ece.cmu.edu: utility provider breaks fi_wait()]

Hefty, Sean sean.hefty at intel.com
Thu Sep 17 09:10:52 PDT 2020


> hello?  is this libfabric list still alive?   i'm hoping for some feedback
> on how to get this fi_wait() issue resolved.

Yes - this list is still alive.

>     I've run into a problem with the fi_wait(waitset,timeout)
> call and the way that libfabric stacks the RXD utility provider
> on top of other providers.  On our system this issue causes
> fi_wait() to block for the full timeout value, even if network
> progress could be made.

>From your comment below, you are running rxd over psm.

>     The issue is this:
> 
>       - providers like PSM, PSM2, and GNI have their own
>         version of the "wait_open" function in their fi_ops_fabric
>         structure.   (see: prov/psm/src/psmx_wait.c,
>         prov/psm2/src/psmx2_wait.c, or prov/gni/src/gnix_wait.c)
> 
>       - these "wait_open" function create a wait object and
>         init the "wait" function pointer in their "fi_ops_wait"
>         struct to point to a provider specific wait function
>         (psmx_wait_wait(), psmx2_wait_wait(), gnix_wait_wait()).
> 
>         The intent here appears to be to route all wait operations
>         for a provider through this *_wait_wait() function to allow
>         it to do additional setup/teardown operations before/after
>         the main wait operation (the main wait operation being something
>         like a linux epoll_wait() system call via util_wait_fd_run()).
> 
> 
>       - the RXD utility provider sets its "wait_open" to point
>         to the generic ofi_wait_fd_open() function and does not
>         provide any "wait" functions for the "fi_ops_wait" structure
>         (so RXD waits go directly to the util_wait_fd_run() function).
> 
>       - RXD stores pointers to the provider it is layered on top of in
>         in various RXD structures:
>            rxd_fabric->dg_fabric, rxd_domain->dg_domain, rxd_ep->dg_cq
> 
> 
>       - when rxd_ep_bind() is called in the FI_CLASS_CQ case,
>         it does this (see "rxd_ep.c" line 753):
> 
>             if (!ep->dg_cq) {
>                /* done by the rxd_dg_cq_open() helper function */
> 
>                 -> call fi_cq_open() on dg_domain, store cq in ep->dg_cq
> 
>                 -> call fi_control(FI_GETWAIT) on &rxd_ep->dg_cq->fid
>                    to get the dg_cq's wait file descriptor and save in
>                    rxd_ep->dg_cq_fd
> 
>                /* end of rxd_dg_cq_open() helper function */
>             }
> 
>             ofi_wait_add_fd(cq->wait /*RXD's CQ*/,
>                             ep->dg_cq_fd /* underlying dg_domain's CQ fd */,
>                             POLLIN, rxd_ep_trywait, ep,
>                             &ep->util_ep.ep_fid.fid);
> 
>         RXD seems to be extracting the epoll_wait() file descriptor
>         from the wait object of the provider it is layered on top of
>         (the dg_domain), and then adding the dg_domain's file descriptor
>         to RXD's wait object.
> 
>         I believe the thinking here is that adding the wait
>         file descriptor from the underlying layer's wait object
>         to RXD's wait object allows the RXD's wait object to handle
>         waits for both RXD and the provider it is layered on top
>         of just using RXD's wait object.   Unfortunately, this
>         is a bad assumption and it does not work.
> 
> 
> Here's the problem: when an application using RXD calls fi_wait()
> it goes directly to the util_wait_fd_run() function and blocks
> in epoll_wait() without ever calling the underlying layer's
> "wait" function from the "fi_ops_wait" structure (i.e. psmx_wait_wait(),
> psmx2_wait_wait(), gnix_wait_wait(), ... are never called!).

Util_wait_fd_run() calls the underlying provider's wait_try() function.  The intent is that any work the core provider needs to do prior to waiting should be done there.

AFAIK, RxD has not been used over psm, psm2, or gni providers.  I don't believe those providers are setup to handle such layering, as they support RDM endpoints directly.

- Sean


More information about the Libfabric-users mailing list