[libfabric-users] Suggestions needed for improved performance
Biddiscombe, John A.
john.biddiscombe at cscs.ch
Thu Jun 9 05:48:43 PDT 2022
I'm looking suggestions on things to try - One of our benchmarks that uses libfabric, performs well enough with small messages. The benchmark is written in such a way that we can swap the back-end for a native MPI implementation, or a libfabric implementation and compare performance. The test uses tagged sends and receives between two nodes and simply does lots of them with a certain number of messages allowed o be 'in flight' per thread at any moment.
On piz daint, the cray machine at CSCS
8 threads, message size 1 byte, 10 per thread in flight at any time
libfabric : 0.80 MB/s
mpi : 0.38 MB/s
8 threads, message size 100 byte, 10 per thread in flight at any time
libfabric : 85 MB/s
mpi : 37 MB/s
8 threads, message size 10000 byte, 10 per thread in flight at any time
libfabric : 3600 MB/s
mpi : 2000 MB/s
8 threads, message size 100000 byte, 10 per thread in flight at any time
libfabric : 10800 MB/s
mpi : 13900 MB/s
We are now lagging well behind mpi, which is reaching the approx BW of the system (as expected, similar to OSU benchmark)
The benchmark uses message buffer objects, which have a custom allocator, all memory from this allocator is pinned using fi_mr_reg (we use FI_MR_BASIC mode). So there is no pinning of memory during the benchmark run - everything is pinned in advance when the memory buffers are created at startup. The messages are sent using tagged send and each buffer has the memory descriptor supplied
m_tx_endpoint.get_ep(), send_region.get_address(), send_region.get_size(),
send_region.get_local_key(), dst_addr_, tag_, ctxt);
So the question is - what could be going wrong for the libfabric backend that causes such a significant drop in relative performance with larger messages. I've experimented with different FI_THREAD_SAFE options and removing/putting locks around the injection and polling code, but since we perform well with small messages - I do not think there is anything wrong with the basic framework around the send/recv and polling functions. It would appear to be a memory size issue. Is libfabric assuming that the buffers are not pinned and wasting time trying to pin them again?
One caveat, the benchmark uses MPI to initialize and so the libfabric tests are coexisting with MPI in the same executable (and using the GNI backend). I was running tests on LUMI (verbs backend) and saw similar speed drops (but on lumi the mpi uses the libfabric backend too), but cannot access the machine now until maintenance is over.
On daint, I launch using MPICH_GNI_NDREG_ENTRIES=1024, set the mem reg to udreg and lazy dreg to true (no that gni should be registering much since we've done it already)
I welcome any suggestions of what mpi might be doing better, or we might be doing wrong. (I tried profiling and saw no obvious hotspots in our code, the major time hog was in polling the receive queues).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users