[libfabric-users] omnipath/psm2 performance degradation in libfabric 1.3.0
Phil Carns
carns at mcs.anl.gov
Mon Jul 12 13:08:36 PDT 2021
Well, mystery partially solved.
If I configure libfabric 1.13.0 with an explicit --disable-psm3 (it was
being enabled by default) then performance is back to normal.
I'll have to dig into the code a little more, but I guess I was somehow
getting the psm3 provider activated at run time instead of psm2?
-Phil
On 7/12/21 2:17 PM, Phil Carns via Libfabric-users wrote:
> Hi all,
>
> I recently tried upgrading from libfabric 1.11.1 to 1.13.0 for some
> builds on a Linux cluster equipped with Haswell processors and an
> OmniPath interconnect. It's using opa-psm2 11.2.185 and the psm2
> provider in libfabric.
>
> The performance dropped immensely after this upgrade, though. We run
> nightly performance regression tests (just simple point to point
> in-house benchmarks). The bandwidth went from a peak of around 11,800
> MiB/s down to 1,300 MiB/s, and the latency went from around 4 usec to
> 18 usec. These results are very consistent.
>
> The libfabric 1.11.1 -> 1.13.0 update didn't alter performance in any
> meaningful way on the other platforms we run nightly tests on
> (including verbs, tcp, sockets, and gni providers). This seems to be
> something peculiar to psm2.
>
> Anyone have any theories what might be going wrong here? Maybe
> something that needs to be configured differently?
>
> I can work on narrowing it down of course, but I thought I would ask
> here for obvious problems before burning time on it.
>
> thanks!
>
> -Phil
>
>
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/libfabric-users
More information about the Libfabric-users
mailing list