[ofiwg] Trying to understand how to use the auth_key field.

Heinz, Michael William michael.william.heinz at cornelisnetworks.com
Thu Jun 17 07:45:57 PDT 2021


Thanks for the reply, Sean.

I agree that the auth_key needs to come from something at a higher level. I've been experimenting with Intel MPI, though, and I can't figure out how to get it to generate one - the auth_key fields in the domain and ep attributes are null when I see them. I've ended up using a shell variable passed in on the mpirun command but I feel like that should be the fallback rather than the only solution.

-----Original Message-----
From: Hefty, Sean <sean.hefty at intel.com> 
Sent: Thursday, June 17, 2021 9:59 AM
To: Heinz, Michael William <michael.william.heinz at cornelisnetworks.com>; ofiwg at lists.openfabrics.org
Subject: RE: Trying to understand how to use the auth_key field.

> I've been trying to figure out the best way to manage job/auth keys in 
> a libfabric provider. PSM2 appears to require the key to be passed in 
> as an environment variable - but will override that value if one is 
> provided in the domain or fabric auth_key fields. (I think?)

My understanding with psm2 is that psm2 library does not define an API for the application to pass in a job key, forcing the use of an environment variable.  So if restricted to using the psm2 API, the environment variable must be set.  But removing that restriction and having a more native OFI provider, the OFI API auth_key can be used.  The evolution of the psm2 provider went from being restricted by the psm2 API to a slightly more native provider.


> That said, I'm not sure how those fields are supposed to be generated. 
> Reviewing other providers, it looks like it's possible for a provider 
> to generate the auth_key but I don't see how that would be globally 
> unique across the fabric and the only provider that seems to do this is gni.

They auth_key should really come from some other entity, like a job or fabric manager.  Ideally, some privileged agent verifies that a process has permission to use an auth_key that it is attempting to use.

Basically, there's a whole other control flow here that's outside the scope of libfabric.  An illustrative flow could be:

1. A central entity allocates a set of keys X & Y for a job.
2. A job manager starts up the ranks.
3. Job manager passes keys X & Y to each process.
4. Process PID P allocates an endpoint with key X.
5. Kernel agent contacts job manager to see if PID P can use X.
6. Job manager replies yes
7. Kernel agent says, okay, cool.  Creates EP, programs X into HW.

If a process is using a single key, this could probably be handled completely outside of libfabric.  For example, the kernel agent could request what key to use.  If the key is complex enough that a rogue app wouldn't be able to guess at a valid key, it may be safe to skip steps 5-6.  But I think keys are usually fairly small.  A reason for having multiple keys might be to separate compute from storage traffic.

- Sean


More information about the ofiwg mailing list