[libfabric-users] Not receiving messages from other ranks
Biddiscombe, John A.
john.biddiscombe at cscs.ch
Fri Feb 12 05:09:09 PST 2021
FYI : If I use FI_RECV | FI_TRANSMIT then the endpoint enables (I still don't receive all messages - but that seems to be a different problem)
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <john.biddiscombe at cscs.ch>
Sent: 12 February 2021 13:05:51
To: Hefty, Sean; libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Not receiving messages from other ranks
After more debugging (and switching to tcp;ofi_rxm since sockets doesn't seem to work), I am left with the following
// create a completion queue for tx
fabric_info_->tx_attr->op_flags |= FI_COMPLETION;
txcq_ = create_completion_queue(fabric_domain_, fabric_info_->tx_attr->size);
// setup an endpoint for sending messages
ep_tx_ = new_endpoint_active(fabric_domain_, fabric_info_, nullptr);
bind_queue_to_endpoint(ep_tx_, txcq_, FI_TRANSMIT);
bind_queue_to_endpoint(ep_rx_, txcq_, FI_TRANSMIT);
When the ifdef is not defined, I bind the txcq_ to the rx endpoint and ask for TRANSMIT completions, everything works fine.
When the ifdef is defined, then I fi_enable gives
ERROR fi_enable : Missing or unavailable completion queue
and I just don't know what's wrong. I'm creating a cq, binding it along with the AV, but it gives me the error.
Any ideas why this might happen? Hopefully by the time the USA wakes up, I'll have found the problem... (fingers crossed)
Thanks for any suggestions
From: Hefty, Sean <sean.hefty at intel.com>
Sent: 10 February 2021 21:48:44
To: Biddiscombe, John A.; libfabric-users at lists.openfabrics.org
Subject: RE: Not receiving messages from other ranks
> Which provider are you using? You may need to call cq read, even for the send side, to
> ensure progress is being driven. If you're using rxm, I believe there's an environment
> variable you can set to force auto-progress. ("fi_info -g rxm" might help discover the
> I poll on the send endpoint tx CQ and I poll on the recv endpoint rx CQ - do you mean
> that I should also create a dummy rx CQ on the send endpoint and poll that too be sure?
I mean read the tx and rx CQs, which it sounds like you are doing.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users