[ofiwg] [libfabric-users] TX/RX data structures and data processing mode
sean.hefty at intel.com
Fri Mar 16 09:50:51 PDT 2018
copying ofiwg -- that mail list is better suited for your questions.
> My group works on implementing of new libfabric provider for our HPC
> interconnect. Our current main goal is to run MPICH and OpenMPI over
> this provider.
> The problem is, that this NIC haven't any software and hardware rx/tx
> queues for send/recv operations. We're decided to implement it on
> libfabric provider-level. So, I'm looking for data structure for queue
> store and processing.
> I took a look in sockets provider code. As far as I understand, tx_ctx
> stores pointers to all information (flags, data, src_address and etc.)
> about every message to send in ring buffer, but rx_ctx stores every
> rx_entry in double-linked list. What was the motivation for choosing
> such data structures when implementing these queues are different used
> to process tx and rx?
Please look at the code in prov/util for help. The socket code was designed around using it as a development tool, so I wouldn't recommend trying to copy its implementation.
The udp provider is a good place to start for how to construct a very simple software provider. You may also want to scan the include/ofi_xxx.h files for helpful abstractions. There's a slightly out of date document in docs/providers that describes what's available. ofi_list.h and ofi_mem.h both have useful abstractions.
> Maybe you can give advice on the implementation of queues or give some
> useful information on this topic?
If you are attempting to implement reliable-datagram semantics, then the use of lists may be better than a queue. Messages may complete out of order when targeting different peers.
Depending on your provider, you may also be able to take advantage of the utility providers. RxM will implement reliable-datagram support over reliable-connections. That is functional today. RxD targets reliable-datagram over unreliable-datagram. That is a work in progress, however.
> The second problem is about suitable way for progress model. For CPU
> performance reasons I want to choose FI_PROGRESS_MANUAL as primary
> mode for the processing of an asynchronous requests, but I do not
> quite understand how an application thread provides data progress. For
> example, is it enough to call fi_cq_read() from MPI implementation
> always when it wants to make a progress?
Yes, the app calling cq_read needs to be sufficient to drive progress. Note that this is expected by the app in the manual progress mode even if no completions are expected.
More information about the ofiwg