[libfabric-users] gni provider and FI_REMOTE_CQ_DATA

D'Alessandro, Luke K ldalessa at iu.edu
Thu Sep 3 16:07:11 PDT 2020


Hi All,

I have some test code that depends on FI_REMOTE_CQ_DATA which I’ve debugged using the  UDP;ofi_rxd provider.

I am trying to run that code using the gni provider on an XC40 but I’m not ever seeing the remote CQ events there. Is there something special I need to set up to get remote CQ events with gni?

I request:

fi_info *hints = fi_allocinfo();
hints->caps                   = FI_RMA | FI_REMOTE_WRITE | FI_RMA_EVENT;
hints->mode                   = FI_CONTEXT | FI_CONTEXT2;
hints->domain_attr->mr_mode   = FI_MR_BASIC;
hints->ep_attr->type          = FI_EP_RDM;
hints->tx_attr->msg_order     = FI_ORDER_WAW | FI_ORDER_RMA_WAW;
hints->rx_attr->msg_order     = FI_ORDER_WAW | FI_ORDER_RMA_WAW;
hints->rx_attr->caps          = FI_RMA | FI_REMOTE_WRITE | FI_RMA_EVENT;

And I successfully receive:

0: # Provider                           Fabric               Domain Version     EP_TYPE    Protocol
0: # gni                                   gni /sys/class/gni/kgni0     1.1   FI_EP_RDM   FI_EP_RDM
0: # gni;ofi_rxd                           gni /sys/class/gni/kgni0   111.0   FI_EP_RDM   FI_EP_RDM
...
1: # Provider                           Fabric               Domain Version     EP_TYPE    Protocol
1: # gni                                   gni /sys/class/gni/kgni0     1.1   FI_EP_RDM   FI_EP_RDM
1: # gni;ofi_rxd                           gni /sys/class/gni/kgni0   111.0   FI_EP_RDM   FI_EP_RDM

I move through sequence of initialization calls that seem to be standard from what I can tell, resulting in an endpoint that is enabled successfully.

static fi_context ep_ctx[2];
check(fi_endpoint, domain, info, &ep, ep_ctx);
check(fi_ep_bind, ep, &tx->fid, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
check(fi_ep_bind, ep, &rx->fid, FI_RECV);
check(fi_ep_bind, ep, &av->fid, 0);
check(fi_enable, ep);

Messages are sent with fi_writemsg and FI_REMOTE_CQ_DATA, and neither fail nor signal FI_EAGAIN (this is a little alarming as I have tx/rx size of 500 and I send more than that through the endpoint, I guess they just vanish into the ether).

int e = fi_writemsg(ep, &msg, FI_REMOTE_CQ_DATA);

if (likely(!e)) {
  return true;
}

if (likely(e == -FI_EAGAIN)) {
  return false;
}

fmt::print(stderr, "[{}] has unhandled tx error {}: {}\n", mpi::rank(), -e, fi_strerror(-e));

Unfortunately I never see any completions on the target rank (unlike UDP;ofi_rxd where things are fine).

Is there some magic that I need with gni to make FI_REMOTE_CQ_DATA work?

Thanks,
Luke
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200903/795f9c03/attachment.htm>


More information about the Libfabric-users mailing list