[libfabric-users] Scalable endpoints
sean.hefty at intel.com
Tue Dec 1 16:45:37 PST 2020
> I'm getting not so great performance with small messages when using one endpoint with a
> single send queue and a single receive queue and N threads all sending and polling
> randomly. I looked through slides and saw mention of scalable endpoints being designed
> for lockless multi-threaded use.
I don't believe there's in-tree hardware that supports scalable endpoints. So, you will be better off using multiple endpoints as an alternative.
> I can't seem to find extra info anywhere and would like to read up a bit and see how to
> improve things. Are there any good resources (like a tutorial recorded from a
> workshop?). Also I've forgotten much of what I used to know and see some code I've left
> in my stuff that creates a shared receive context, but I never use it for anything. I
> can't remember what the context is for and would like to relearn this stuff. (I don't
> even remember what I need the contexts for now).
It helps if you think of a 'context' as a hardware command queue. A standard endpoint has 1 address, 1 transmit queue, and 1 receive queue. A scalable EP has 1 address, but can have multiple dedicated transmit/receive queues. A shared context (i.e. queue) allows a single hardware queue to be shared across multiple endpoints.
Verbs based devices define shared receive queues, but not shared transmit queues. A shared receive context makes the most sense in the context of connected endpoints.
Note that OFI uses the term context instead of queue because transmit and receive contexts are not restricted to acting in a FIFO manner. That is dependent on message and data ordering properties of the endpoint.
> Should I have a single endpoint per thread for sending and a shared one for receiving
> from all threads? should I have scalable endpoints etc etc. Where can I find a good
> place to find out this info?
The man pages are the only place where this is really documented. If you are using connectionless endpoints, you should see better multi-threaded scaling by having dedicated resources per thread. There is a domain attribute, threading, that can be specified which helps guide the locking behavior implemented by the providers. For example, FI_THREAD_DOMAIN or FI_THREAD_COMPLETION can result in some locks becoming noop calls for some providers.
The drawback is that you'll allocate more resources per application that if sharing resources among the threads.
More information about the Libfabric-users