[libfabric-users] questions on handling CQ overrun
krehm at cray.com
Sat Aug 31 05:19:12 PDT 2019
I have questions about how to handle CQ overrun.
Suppose one has a libfabric domain in which there are a very large number of both servers and clients. The servers present a distributed hash table interface to the clients. Both clients and servers operate largely independently of their peers. An individual client hashes an object, takes the modulo based on the number of servers, and from that determines to which server to send a message for that particular object.
The only method I see for preventing CQ overrun at a target is for the initiator to make sure that it never has more messages in flight to the target than the depth of the target’s CQ. That works fine if the target has a separate endpoint for each initiator, but that seems like a scaling problem in very large configurations like the one above. Independent clients don’t know when other clients might be communicating with a particular server at the same time. While each client might limit the number of outstanding messages to a server, the combination of many simultaneously communicating clients can still easily overrun the server.
How do producers handle CQ overrun? In the case of GNI, I see that it simply discards arriving CQEs if its CQ happens to be full. Do other producers handle this differently?
Are there other methods available for preventing CQ overrun at a target besides requiring each target to have a separate endpoint for each initiator? Or are there efficient, low-latency methods for recovering from CQ overrun in a configuration like the above?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users