[ofiwg] Trying to understand how to use the auth_key field.

Thu Jun 17 06:59:14 PDT 2021

> I’ve been trying to figure out the best way to manage job/auth keys in a libfabric
> provider. PSM2 appears to require the key to be passed in as an environment variable –
> but will override that value if one is provided in the domain or fabric auth_key
> fields. (I think?)

My understanding with psm2 is that psm2 library does not define an API for the application to pass in a job key, forcing the use of an environment variable.  So if restricted to using the psm2 API, the environment variable must be set.  But removing that restriction and having a more native OFI provider, the OFI API auth_key can be used.  The evolution of the psm2 provider went from being restricted by the psm2 API to a slightly more native provider.

> That said, I’m not sure how those fields are supposed to be generated. Reviewing other
> providers, it looks like it’s possible for a provider to generate the auth_key but I
> don’t see how that would be globally unique across the fabric and the only provider
> that seems to do this is gni.

They auth_key should really come from some other entity, like a job or fabric manager.  Ideally, some privileged agent verifies that a process has permission to use an auth_key that it is attempting to use.

Basically, there's a whole other control flow here that's outside the scope of libfabric.  An illustrative flow could be:

1. A central entity allocates a set of keys X & Y for a job.
2. A job manager starts up the ranks.
3. Job manager passes keys X & Y to each process.
4. Process PID P allocates an endpoint with key X.
5. Kernel agent contacts job manager to see if PID P can use X.
6. Job manager replies yes
7. Kernel agent says, okay, cool.  Creates EP, programs X into HW.

If a process is using a single key, this could probably be handled completely outside of libfabric.  For example, the kernel agent could request what key to use.  If the key is complex enough that a rogue app wouldn't be able to guess at a valid key, it may be safe to skip steps 5-6.  But I think keys are usually fairly small.  A reason for having multiple keys might be to separate compute from storage traffic.

- Sean