[libfabric-users] [External] gni provider and FI_REMOTE_CQ_DATA

D'Alessandro, Luke K ldalessa at iu.edu
Fri Sep 4 16:22:15 PDT 2020

So I just discovered in the log:

libfabric:31689:gni:ep_data:gnix_ops_allowed():887<debug> [31689:2] flags:0x2220204, FI_REMOTE_CQ_DATA, FI_FENCE, FI_INJECT
libfabric:31689:gni:ep_data:gnix_ops_allowed():889<debug> [31689:2] peer_caps:0x118000000312004, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE
libfabric:31689:gni:ep_data:gnix_ops_allowed():891<debug> [31689:2] caps:0x118000000312004, FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT
libfabric:31689:gni:cq:_gnix_cq_add_error():325<info> [31689:2] creating error event entry

And some hunting in a debug build shows me that I’m failing at https://github.com/ofiwg/libfabric/blob/master/prov/gni/src/gnix_rma.c#L1224.

I guess that I haven’t set up the endpoint/cq appropriately, so I’ll keep poking at that to see where I have gone wrong.


On Sep 3, 2020, at 4:07 PM, D'Alessandro, Luke K <ldalessa at iu.edu<mailto:ldalessa at iu.edu>> wrote:

This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources.

Hi All,

I have some test code that depends on FI_REMOTE_CQ_DATA which I’ve debugged using the  UDP;ofi_rxd provider.

I am trying to run that code using the gni provider on an XC40 but I’m not ever seeing the remote CQ events there. Is there something special I need to set up to get remote CQ events with gni?

I request:

fi_info *hints = fi_allocinfo();
hints->caps                   = FI_RMA | FI_REMOTE_WRITE | FI_RMA_EVENT;
hints->mode                   = FI_CONTEXT | FI_CONTEXT2;
hints->domain_attr->mr_mode   = FI_MR_BASIC;
hints->ep_attr->type          = FI_EP_RDM;
hints->tx_attr->msg_order     = FI_ORDER_WAW | FI_ORDER_RMA_WAW;
hints->rx_attr->msg_order     = FI_ORDER_WAW | FI_ORDER_RMA_WAW;
hints->rx_attr->caps          = FI_RMA | FI_REMOTE_WRITE | FI_RMA_EVENT;

And I successfully receive:

0: # Provider                           Fabric               Domain Version     EP_TYPE    Protocol
0: # gni                                   gni /sys/class/gni/kgni0     1.1   FI_EP_RDM   FI_EP_RDM
0: # gni;ofi_rxd                           gni /sys/class/gni/kgni0   111.0   FI_EP_RDM   FI_EP_RDM
1: # Provider                           Fabric               Domain Version     EP_TYPE    Protocol
1: # gni                                   gni /sys/class/gni/kgni0     1.1   FI_EP_RDM   FI_EP_RDM
1: # gni;ofi_rxd                           gni /sys/class/gni/kgni0   111.0   FI_EP_RDM   FI_EP_RDM

I move through sequence of initialization calls that seem to be standard from what I can tell, resulting in an endpoint that is enabled successfully.

static fi_context ep_ctx[2];
check(fi_endpoint, domain, info, &ep, ep_ctx);
check(fi_ep_bind, ep, &tx->fid, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
check(fi_ep_bind, ep, &rx->fid, FI_RECV);
check(fi_ep_bind, ep, &av->fid, 0);
check(fi_enable, ep);

Messages are sent with fi_writemsg and FI_REMOTE_CQ_DATA, and neither fail nor signal FI_EAGAIN (this is a little alarming as I have tx/rx size of 500 and I send more than that through the endpoint, I guess they just vanish into the ether).

int e = fi_writemsg(ep, &msg, FI_REMOTE_CQ_DATA);

if (likely(!e)) {
  return true;

if (likely(e == -FI_EAGAIN)) {
  return false;

fmt::print(stderr, "[{}] has unhandled tx error {}: {}\n", mpi::rank(), -e, fi_strerror(-e));

Unfortunately I never see any completions on the target rank (unlike UDP;ofi_rxd where things are fine).

Is there some magic that I need with gni to make FI_REMOTE_CQ_DATA work?

Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org<mailto:Libfabric-users at lists.openfabrics.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200904/a18aca4b/attachment.htm>

More information about the Libfabric-users mailing list