[libfabric-users] omnipath/psm2 performance degradation in libfabric 1.3.0

Phil Carns carns at mcs.anl.gov
Mon Jul 12 13:08:36 PDT 2021


Well, mystery partially solved.

If I configure libfabric 1.13.0 with an explicit --disable-psm3 (it was 
being enabled by default) then performance is back to normal.

I'll have to dig into the code a little more, but I guess I was somehow 
getting the psm3 provider activated at run time instead of psm2?

-Phil

On 7/12/21 2:17 PM, Phil Carns via Libfabric-users wrote:
> Hi all,
>
> I recently tried upgrading from libfabric 1.11.1 to 1.13.0 for some 
> builds on a Linux cluster equipped with Haswell processors and an 
> OmniPath interconnect.  It's using opa-psm2 11.2.185 and the psm2 
> provider in libfabric.
>
> The performance dropped immensely after this upgrade, though.  We run 
> nightly performance regression tests (just simple point to point 
> in-house benchmarks).  The bandwidth went from a peak of around 11,800 
> MiB/s down to 1,300 MiB/s, and the latency went from around 4 usec to 
> 18 usec.  These results are very consistent.
>
> The libfabric 1.11.1 -> 1.13.0 update didn't alter performance in any 
> meaningful way on the other platforms we run nightly tests on 
> (including verbs, tcp, sockets, and gni providers). This seems to be 
> something peculiar to psm2.
>
> Anyone have any theories what might be going wrong here?  Maybe 
> something that needs to be configured differently?
>
> I can work on narrowing it down of course, but I thought I would ask 
> here for obvious problems before burning time on it.
>
> thanks!
>
> -Phil
>
>
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/libfabric-users


More information about the Libfabric-users mailing list