[libfabric-users] [External] gni provider and FI_REMOTE_CQ_DATA
D'Alessandro, Luke K
ldalessa at iu.edu
Fri Sep 4 16:22:15 PDT 2020
So I just discovered in the log:
libfabric:31689:gni:ep_data:gnix_ops_allowed():887<debug> [31689:2] flags:0x2220204, FI_REMOTE_CQ_DATA, FI_FENCE, FI_INJECT
libfabric:31689:gni:ep_data:gnix_ops_allowed():889<debug> [31689:2] peer_caps:0x118000000312004, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE
libfabric:31689:gni:ep_data:gnix_ops_allowed():891<debug> [31689:2] caps:0x118000000312004, FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_TRIGGER, FI_FENCE, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_RMA_EVENT
libfabric:31689:gni:cq:_gnix_cq_add_error():325<info> [31689:2] creating error event entry
And some hunting in a debug build shows me that I’m failing at https://github.com/ofiwg/libfabric/blob/master/prov/gni/src/gnix_rma.c#L1224.
I guess that I haven’t set up the endpoint/cq appropriately, so I’ll keep poking at that to see where I have gone wrong.
Thanks,
Luke
On Sep 3, 2020, at 4:07 PM, D'Alessandro, Luke K <ldalessa at iu.edu<mailto:ldalessa at iu.edu>> wrote:
This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources.
Hi All,
I have some test code that depends on FI_REMOTE_CQ_DATA which I’ve debugged using the UDP;ofi_rxd provider.
I am trying to run that code using the gni provider on an XC40 but I’m not ever seeing the remote CQ events there. Is there something special I need to set up to get remote CQ events with gni?
I request:
fi_info *hints = fi_allocinfo();
hints->caps = FI_RMA | FI_REMOTE_WRITE | FI_RMA_EVENT;
hints->mode = FI_CONTEXT | FI_CONTEXT2;
hints->domain_attr->mr_mode = FI_MR_BASIC;
hints->ep_attr->type = FI_EP_RDM;
hints->tx_attr->msg_order = FI_ORDER_WAW | FI_ORDER_RMA_WAW;
hints->rx_attr->msg_order = FI_ORDER_WAW | FI_ORDER_RMA_WAW;
hints->rx_attr->caps = FI_RMA | FI_REMOTE_WRITE | FI_RMA_EVENT;
And I successfully receive:
0: # Provider Fabric Domain Version EP_TYPE Protocol
0: # gni gni /sys/class/gni/kgni0 1.1 FI_EP_RDM FI_EP_RDM
0: # gni;ofi_rxd gni /sys/class/gni/kgni0 111.0 FI_EP_RDM FI_EP_RDM
...
1: # Provider Fabric Domain Version EP_TYPE Protocol
1: # gni gni /sys/class/gni/kgni0 1.1 FI_EP_RDM FI_EP_RDM
1: # gni;ofi_rxd gni /sys/class/gni/kgni0 111.0 FI_EP_RDM FI_EP_RDM
I move through sequence of initialization calls that seem to be standard from what I can tell, resulting in an endpoint that is enabled successfully.
static fi_context ep_ctx[2];
check(fi_endpoint, domain, info, &ep, ep_ctx);
check(fi_ep_bind, ep, &tx->fid, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
check(fi_ep_bind, ep, &rx->fid, FI_RECV);
check(fi_ep_bind, ep, &av->fid, 0);
check(fi_enable, ep);
Messages are sent with fi_writemsg and FI_REMOTE_CQ_DATA, and neither fail nor signal FI_EAGAIN (this is a little alarming as I have tx/rx size of 500 and I send more than that through the endpoint, I guess they just vanish into the ether).
int e = fi_writemsg(ep, &msg, FI_REMOTE_CQ_DATA);
if (likely(!e)) {
return true;
}
if (likely(e == -FI_EAGAIN)) {
return false;
}
fmt::print(stderr, "[{}] has unhandled tx error {}: {}\n", mpi::rank(), -e, fi_strerror(-e));
Unfortunately I never see any completions on the target rank (unlike UDP;ofi_rxd where things are fine).
Is there some magic that I need with gni to make FI_REMOTE_CQ_DATA work?
Thanks,
Luke
_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org<mailto:Libfabric-users at lists.openfabrics.org>
https://lists.openfabrics.org/mailman/listinfo/libfabric-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200904/a18aca4b/attachment.htm>
More information about the Libfabric-users
mailing list