[libfabric-users] Detecting errors/flow control with FI_SELECTIVE_COMPLETION

Mon Aug 24 13:45:14 PDT 2020

> I’m trying to use libfabric to get some baseline performance numbers for some research
> that we’re doing.
> 
> The functionality that I need is simply to transfer (address, value) pairs via remote
> completion in an all-to-all setting.
> 
> I have sequential ranks, and have coopted RDM/sockets/fi_inject_writedata with 0-length
> messages to do this, and it implements the required semantics correctly.
> 
> The problem that I have is that I can’t figure out how to implement flow control in
> this setting. Obviously I should have resource issues in both the local and remote
> endpoints, and I’m able to slow down and/or buffer on the TX side if need be.

If you're using sock_stream underneath, then the tcp layer is already handling flow control between peers.  The buffering will end up being done in the local kernel.

> I am using a TX:FI_SELECTIVE_COMPLETION cq, an RX cq, and a TX counter. I definitely
> see <warn> messages if I spam tx or don’t complete rx fast enough, but I can’t seem to
> detect any errors/failures on the TX side that would let me slow down (neither the cq
> nor the counter ever reports an error).

Slowdowns on the Tx side should result in the transmit operation returning -FI_EAGAIN.  I need to check the code to be sure, but I believe sockets will dynamically grow the CQ if needed.

> > libfabric:290798:sockets:cq:_sock_cq_write():181<warn> Not enough space in CQ
> > rank 0 posting 4279 to 0
> > rank [0] fi error: Resource temporarily unavailable
> > libfabric:290798:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded
> buffered recv limit
> > [portland:290798] *** Process received signal ***
> > libfabric:290798:sockets:ep_data:sock_pe_new_tx_entry():2251<warn> Invalid operation
> type
> > libfabric:290798:sockets:ep_data:sock_pe_progress_tx_ctx():2546<warn> failed to
> progress TX ctx
> > [portland:290798] Signal: Aborted (6)
> > [portland:290798] Signal code:  (-6)
> > libfabric:290798:sockets:ep_data:sock_pe_progress_thread():2650<warn> failed to
> progress TX
> 
> 
> Any ideas on what I can do here to discover these <warn>s eagerly at the user level?
> 
> I can easily set up some hard limits on the number of outstanding TX operations
> (computed via the TX counter), but I don’t know where to find out what the right number
> for that would be. Also, given the all-to-all nature of the communication I can
> dedicate point-to-point remote RX accounting if that’s something I need to do manually.

The Tx/Rx sizes are set through the fi_info attributes.  You can specify/retrieve them through fi_getinfo().

The providers typically buffer unexpected messages (or at least the message headers) at the receiver, while there is sufficient memory.  Flow control across multiple peers is difficult to achieve without introducing the possibility of application deadlock.

- Sean