[libfabric-users] Not receiving messages from other ranks

Biddiscombe, John A. john.biddiscombe at cscs.ch
Fri Feb 12 05:09:09 PST 2021


FYI : If I use FI_RECV | FI_TRANSMIT then the endpoint enables (I still don't receive all messages - but that seems to be a different problem)

________________________________
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <john.biddiscombe at cscs.ch>
Sent: 12 February 2021 13:05:51
To: Hefty, Sean; libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Not receiving messages from other ranks


After more debugging (and switching to tcp;ofi_rxm since sockets doesn't seem to work), I am left with the following


            // create a completion queue for tx
            fabric_info_->tx_attr->op_flags |= FI_COMPLETION;
            txcq_ = create_completion_queue(fabric_domain_, fabric_info_->tx_attr->size);
#ifdef SEPARATE_RX_TX_ENDPOINTS
            // setup an endpoint for sending messages
            ep_tx_ = new_endpoint_active(fabric_domain_, fabric_info_, nullptr);
            bind_queue_to_endpoint(ep_tx_, txcq_, FI_TRANSMIT);
            bind_address_vector_to_endpoint(ep_tx_, av_);
            enable_endpoint(ep_tx_);
#else
            bind_queue_to_endpoint(ep_rx_, txcq_, FI_TRANSMIT);
#endif

When the ifdef is not defined, I bind the txcq_ to the rx endpoint and ask for TRANSMIT completions, everything works fine.

When the ifdef is defined, then I fi_enable gives

    ERROR fi_enable : Missing or unavailable completion queue

and I just don't know what's wrong. I'm creating a cq, binding it along with the AV, but it gives me the error.

Any ideas why this might happen? Hopefully by the time the USA wakes up, I'll have found the problem... (fingers crossed)

Thanks for any suggestions

JB

________________________________
From: Hefty, Sean <sean.hefty at intel.com>
Sent: 10 February 2021 21:48:44
To: Biddiscombe, John A.; libfabric-users at lists.openfabrics.org
Subject: RE: Not receiving messages from other ranks

> Which provider are you using?  You may need to call cq read, even for the send side, to
> ensure progress is being driven.  If you're using rxm, I believe there's an environment
> variable you can set to force auto-progress.  ("fi_info -g rxm" might help discover the
> name.).
> <
> I poll on the send endpoint tx CQ and I poll on the recv endpoint rx CQ - do you mean
> that I should also create a dummy rx CQ on the send endpoint and poll that too be sure?

I mean read the tx and rx CQs, which it sounds like you are doing.

- Sean

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210212/0cb7ebd6/attachment.htm>


More information about the Libfabric-users mailing list