[libfabric-users] omnipath/psm2 performance degradation in libfabric 1.3.0
carns at mcs.anl.gov
Mon Jul 12 11:17:59 PDT 2021
I recently tried upgrading from libfabric 1.11.1 to 1.13.0 for some
builds on a Linux cluster equipped with Haswell processors and an
OmniPath interconnect. It's using opa-psm2 11.2.185 and the psm2
provider in libfabric.
The performance dropped immensely after this upgrade, though. We run
nightly performance regression tests (just simple point to point
in-house benchmarks). The bandwidth went from a peak of around 11,800
MiB/s down to 1,300 MiB/s, and the latency went from around 4 usec to 18
usec. These results are very consistent.
The libfabric 1.11.1 -> 1.13.0 update didn't alter performance in any
meaningful way on the other platforms we run nightly tests on (including
verbs, tcp, sockets, and gni providers). This seems to be something
peculiar to psm2.
Anyone have any theories what might be going wrong here? Maybe
something that needs to be configured differently?
I can work on narrowing it down of course, but I thought I would ask
here for obvious problems before burning time on it.
More information about the Libfabric-users