[libfabric-users] [External] Detecting errors/flow control with FI_SELECTIVE_COMPLETION

Mon Aug 24 14:02:47 PDT 2020

> On Aug 24, 2020, at 1:45 PM, Hefty, Sean <sean.hefty at intel.com> wrote:
> 
> This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources.
> -------
> 
>> I’m trying to use libfabric to get some baseline performance numbers for some research
>> that we’re doing.
>> 
>> The functionality that I need is simply to transfer (address, value) pairs via remote
>> completion in an all-to-all setting.
>> 
>> I have sequential ranks, and have coopted RDM/sockets/fi_inject_writedata with 0-length
>> messages to do this, and it implements the required semantics correctly.
>> 
>> The problem that I have is that I can’t figure out how to implement flow control in
>> this setting. Obviously I should have resource issues in both the local and remote
>> endpoints, and I’m able to slow down and/or buffer on the TX side if need be.
> 
> If you're using sock_stream underneath, then the tcp layer is already handling flow control between peers.  The buffering will end up being done in the local kernel.
> 
>> I am using a TX:FI_SELECTIVE_COMPLETION cq, an RX cq, and a TX counter. I definitely
>> see <warn> messages if I spam tx or don’t complete rx fast enough, but I can’t seem to
>> detect any errors/failures on the TX side that would let me slow down (neither the cq
>> nor the counter ever reports an error).
> 
> Slowdowns on the Tx side should result in the transmit operation returning -FI_EAGAIN.  I need to check the code to be sure, but I believe sockets will dynamically grow the CQ if needed.

Ohhh, so the fact that I’m not ever seeing any -FI_EAGAIN from fi_inject_writedata (or the fi_writemsg equivalent) might just be an artifact of the sockets provider that I’m testing with. If I move over to a different provider I might be able to see these occur? I’ll give that a shot.

>>> libfabric:290798:sockets:cq:_sock_cq_write():181<warn> Not enough space in CQ
>>> rank 0 posting 4279 to 0
>>> rank [0] fi error: Resource temporarily unavailable
>>> libfabric:290798:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
>>> [portland:290798] *** Process received signal ***
>>> libfabric:290798:sockets:ep_data:sock_pe_new_tx_entry():2251<warn> Invalid operation type
>>> libfabric:290798:sockets:ep_data:sock_pe_progress_tx_ctx():2546<warn> failed to progress TX ctx
>>> [portland:290798] Signal: Aborted (6)
>>> [portland:290798] Signal code:  (-6)
>>> libfabric:290798:sockets:ep_data:sock_pe_progress_thread():2650<warn> failed to progress TX
>> 
>> 
>> Any ideas on what I can do here to discover these <warn>s eagerly at the user level?
>> 
>> I can easily set up some hard limits on the number of outstanding TX operations
>> (computed via the TX counter), but I don’t know where to find out what the right number
>> for that would be. Also, given the all-to-all nature of the communication I can
>> dedicate point-to-point remote RX accounting if that’s something I need to do manually.
> 
> The Tx/Rx sizes are set through the fi_info attributes.  You can specify/retrieve them through fi_getinfo().

Okay, I’ve been using them to initialize tx/rq cq sizes but they didn’t seem to directly correspond to any changes in behavior that I could observe from the sockets provider.

> The providers typically buffer unexpected messages (or at least the message headers) at the receiver, while there is sufficient memory.  Flow control across multiple peers is difficult to achieve without introducing the possibility of application deadlock.
> 
> - Sean