[libfabric-users] Not receiving messages from other ranks

Hefty, Sean sean.hefty at intel.com
Fri Feb 12 12:04:20 PST 2021


I'm looking into this.  The checks in the tcp provider are too strict.  I have a patch that fixes that, but I doubt it will help here.  I'm still analyzing the rxm code to understand if it's passing the right capabilities to tcp and handling its checks correctly.


> -----Original Message-----
> From: Biddiscombe, John A. <john.biddiscombe at cscs.ch>
> Sent: Friday, February 12, 2021 5:09 AM
> To: Hefty, Sean <sean.hefty at intel.com>; libfabric-users at lists.openfabrics.org
> Subject: Re: Not receiving messages from other ranks
> 
> FYI : If I use FI_RECV | FI_TRANSMIT then the endpoint enables (I still don't receive
> all messages - but that seems to be a different problem)
> 
> 
> ________________________________
> 
> From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of
> Biddiscombe, John A. <john.biddiscombe at cscs.ch>
> Sent: 12 February 2021 13:05:51
> To: Hefty, Sean; libfabric-users at lists.openfabrics.org
> Subject: Re: [libfabric-users] Not receiving messages from other ranks
> 
> 
> After more debugging (and switching to tcp;ofi_rxm since sockets doesn't seem to work),
> I am left with the following
> 
> 
> 
> 
>             // create a completion queue for tx
>             fabric_info_->tx_attr->op_flags |= FI_COMPLETION;
>             txcq_ = create_completion_queue(fabric_domain_, fabric_info_->tx_attr-
> >size);
> #ifdef SEPARATE_RX_TX_ENDPOINTS
>             // setup an endpoint for sending messages
>             ep_tx_ = new_endpoint_active(fabric_domain_, fabric_info_, nullptr);
>             bind_queue_to_endpoint(ep_tx_, txcq_, FI_TRANSMIT);
>             bind_address_vector_to_endpoint(ep_tx_, av_);
>             enable_endpoint(ep_tx_);
> #else
>             bind_queue_to_endpoint(ep_rx_, txcq_, FI_TRANSMIT);
> #endif
> 
> When the ifdef is not defined, I bind the txcq_ to the rx endpoint and ask for TRANSMIT
> completions, everything works fine.
> 
> When the ifdef is defined, then I fi_enable gives
> 
>     ERROR fi_enable : Missing or unavailable completion queue
> 
> 
> and I just don't know what's wrong. I'm creating a cq, binding it along with the AV,
> but it gives me the error.
> 
> Any ideas why this might happen? Hopefully by the time the USA wakes up, I'll have
> found the problem... (fingers crossed)
> 
> Thanks for any suggestions
> 
> 
> JB
> 
> 
> ________________________________
> 
> From: Hefty, Sean <sean.hefty at intel.com>
> Sent: 10 February 2021 21:48:44
> To: Biddiscombe, John A.; libfabric-users at lists.openfabrics.org
> Subject: RE: Not receiving messages from other ranks
> 
> > Which provider are you using?  You may need to call cq read, even for the send side,
> to
> > ensure progress is being driven.  If you're using rxm, I believe there's an
> environment
> > variable you can set to force auto-progress.  ("fi_info -g rxm" might help discover
> the
> > name.).
> > <
> > I poll on the send endpoint tx CQ and I poll on the recv endpoint rx CQ - do you mean
> > that I should also create a dummy rx CQ on the send endpoint and poll that too be
> sure?
> 
> I mean read the tx and rx CQs, which it sounds like you are doing.
> 
> - Sean
> 



More information about the Libfabric-users mailing list