[libfabric-users] omnipath/psm2 performance degradation in libfabric 1.3.0

Phil Carns carns at mcs.anl.gov
Mon Jul 12 11:17:59 PDT 2021


Hi all,

I recently tried upgrading from libfabric 1.11.1 to 1.13.0 for some 
builds on a Linux cluster equipped with Haswell processors and an 
OmniPath interconnect.  It's using opa-psm2 11.2.185 and the psm2 
provider in libfabric.

The performance dropped immensely after this upgrade, though.  We run 
nightly performance regression tests (just simple point to point 
in-house benchmarks).  The bandwidth went from a peak of around 11,800 
MiB/s down to 1,300 MiB/s, and the latency went from around 4 usec to 18 
usec.  These results are very consistent.

The libfabric 1.11.1 -> 1.13.0 update didn't alter performance in any 
meaningful way on the other platforms we run nightly tests on (including 
verbs, tcp, sockets, and gni providers). This seems to be something 
peculiar to psm2.

Anyone have any theories what might be going wrong here?  Maybe 
something that needs to be configured differently?

I can work on narrowing it down of course, but I thought I would ask 
here for obvious problems before burning time on it.

thanks!

-Phil




More information about the Libfabric-users mailing list