[libfabric-users] Not receiving messages from other ranks
Hefty, Sean
sean.hefty at intel.com
Fri Feb 12 12:04:20 PST 2021
I'm looking into this. The checks in the tcp provider are too strict. I have a patch that fixes that, but I doubt it will help here. I'm still analyzing the rxm code to understand if it's passing the right capabilities to tcp and handling its checks correctly.
> -----Original Message-----
> From: Biddiscombe, John A. <john.biddiscombe at cscs.ch>
> Sent: Friday, February 12, 2021 5:09 AM
> To: Hefty, Sean <sean.hefty at intel.com>; libfabric-users at lists.openfabrics.org
> Subject: Re: Not receiving messages from other ranks
>
> FYI : If I use FI_RECV | FI_TRANSMIT then the endpoint enables (I still don't receive
> all messages - but that seems to be a different problem)
>
>
> ________________________________
>
> From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of
> Biddiscombe, John A. <john.biddiscombe at cscs.ch>
> Sent: 12 February 2021 13:05:51
> To: Hefty, Sean; libfabric-users at lists.openfabrics.org
> Subject: Re: [libfabric-users] Not receiving messages from other ranks
>
>
> After more debugging (and switching to tcp;ofi_rxm since sockets doesn't seem to work),
> I am left with the following
>
>
>
>
> // create a completion queue for tx
> fabric_info_->tx_attr->op_flags |= FI_COMPLETION;
> txcq_ = create_completion_queue(fabric_domain_, fabric_info_->tx_attr-
> >size);
> #ifdef SEPARATE_RX_TX_ENDPOINTS
> // setup an endpoint for sending messages
> ep_tx_ = new_endpoint_active(fabric_domain_, fabric_info_, nullptr);
> bind_queue_to_endpoint(ep_tx_, txcq_, FI_TRANSMIT);
> bind_address_vector_to_endpoint(ep_tx_, av_);
> enable_endpoint(ep_tx_);
> #else
> bind_queue_to_endpoint(ep_rx_, txcq_, FI_TRANSMIT);
> #endif
>
> When the ifdef is not defined, I bind the txcq_ to the rx endpoint and ask for TRANSMIT
> completions, everything works fine.
>
> When the ifdef is defined, then I fi_enable gives
>
> ERROR fi_enable : Missing or unavailable completion queue
>
>
> and I just don't know what's wrong. I'm creating a cq, binding it along with the AV,
> but it gives me the error.
>
> Any ideas why this might happen? Hopefully by the time the USA wakes up, I'll have
> found the problem... (fingers crossed)
>
> Thanks for any suggestions
>
>
> JB
>
>
> ________________________________
>
> From: Hefty, Sean <sean.hefty at intel.com>
> Sent: 10 February 2021 21:48:44
> To: Biddiscombe, John A.; libfabric-users at lists.openfabrics.org
> Subject: RE: Not receiving messages from other ranks
>
> > Which provider are you using? You may need to call cq read, even for the send side,
> to
> > ensure progress is being driven. If you're using rxm, I believe there's an
> environment
> > variable you can set to force auto-progress. ("fi_info -g rxm" might help discover
> the
> > name.).
> > <
> > I poll on the send endpoint tx CQ and I poll on the recv endpoint rx CQ - do you mean
> > that I should also create a dummy rx CQ on the send endpoint and poll that too be
> sure?
>
> I mean read the tx and rx CQs, which it sounds like you are doing.
>
> - Sean
>
More information about the Libfabric-users
mailing list