[ofiwg] [EXTERNAL] Re: Trying to understand how to use the auth_key field.
Latham, Robert J.
robl at mcs.anl.gov
Mon Jul 12 11:48:22 PDT 2021
On Thu, 2021-06-17 at 15:49 +0000, Pritchard Jr., Howard via ofiwg
wrote:
> Hi All,
>
> For cray aries network the auth key is handled by two external
> widgets:
> 1. part of job launching procedure either with aprun or slurm, or
> 2. there' an rdma credentials server an application can use -
>
https://cug.org/proceedings/cug2016_proceedings/includes/files/pap108s2-file1.pdf
> I think mercury and some other libfabric consumers have used that.
This is getting a little far afield from libfabric, but maybe Michael
Heinz might appreciate if I provide a more concrete example of a
higher-level library (mercury) using auth_key to manage RDMA
credentials
A service provider acquires a credential allowing other processes not
in this cotext (e.g. aprun or srun)
https://github.com/mochi-hpc/mochi-ssg/blob/main/tests/ssg-launch-group-drc.c#L172
The provider shares that credential with the other provider processes
via MPI
https://github.com/mochi-hpc/mochi-ssg/blob/main/tests/ssg-launch-group-drc.c#L175
or PMIx
https://github.com/mochi-hpc/mochi-ssg/blob/main/tests/ssg-launch-group-drc.c#L203
We turn the 'credential' into a 'cookie'
https://github.com/mochi-hpc/mochi-ssg/blob/main/tests/ssg-launch-group-drc.c#L231
And stash that string-type cookie into Mercury's "auth_key"
https://github.com/mochi-hpc/mochi-ssg/blob/main/tests/ssg-launch-group-drc.c#L235
This provider saves a little blob of state, containing information such
as the network address of the provider and this credential. Clients of
this provder load up this blob, obtain the credential, and inform
mercury of the "auth_key" to use for communication:
https://github.com/mochi-hpc/mochi-ssg/blob/main/tests/ssg-observe-group-drc.c#L112
Now that I write this all out it sounds kind of convoluted, but it
turns out to be more portable than relying on Cray "aprun" protection
domains.
==rob
> In both cases It's an external agent that is handling this.
>
> I believe for HPE slingshot11 there's a pmix plugin that will do 1
> (not sure about that though)
>
> Howar
>
>
> On 6/17/21, 8:57 AM, "ofiwg on behalf of Hefty, Sean" <
> ofiwg-bounces at lists.openfabrics.org on behalf of sean.hefty at intel.com
> > wrote:
>
> > Thanks for the reply, Sean.
> >
> > I agree that the auth_key needs to come from something at a
> higher level. I've been
> > experimenting with Intel MPI, though, and I can't figure out
> how to get it to generate
> > one - the auth_key fields in the domain and ep attributes are
> null when I see them.
> > I've ended up using a shell variable passed in on the mpirun
> command but I feel like
> > that should be the fallback rather than the only solution.
>
> I don't know how Intel MPI handles job keys. But having MPI
> generate a key doesn't seem any better than libfabric generating one,
> unless you're including mpirun or the the start-up as part of
> MPI. I'll forward your email separately to one of the MPI
> developers.
>
> - Sean
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/ofiwg
>
> _______________________________________________
> ofiwg mailing list
> ofiwg at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/ofiwg
More information about the ofiwg
mailing list