[ofiwg] Trying to understand how to use the auth_key field.
Hefty, Sean
sean.hefty at intel.com
Thu Jun 17 06:59:14 PDT 2021
> I’ve been trying to figure out the best way to manage job/auth keys in a libfabric
> provider. PSM2 appears to require the key to be passed in as an environment variable –
> but will override that value if one is provided in the domain or fabric auth_key
> fields. (I think?)
My understanding with psm2 is that psm2 library does not define an API for the application to pass in a job key, forcing the use of an environment variable. So if restricted to using the psm2 API, the environment variable must be set. But removing that restriction and having a more native OFI provider, the OFI API auth_key can be used. The evolution of the psm2 provider went from being restricted by the psm2 API to a slightly more native provider.
> That said, I’m not sure how those fields are supposed to be generated. Reviewing other
> providers, it looks like it’s possible for a provider to generate the auth_key but I
> don’t see how that would be globally unique across the fabric and the only provider
> that seems to do this is gni.
They auth_key should really come from some other entity, like a job or fabric manager. Ideally, some privileged agent verifies that a process has permission to use an auth_key that it is attempting to use.
Basically, there's a whole other control flow here that's outside the scope of libfabric. An illustrative flow could be:
1. A central entity allocates a set of keys X & Y for a job.
2. A job manager starts up the ranks.
3. Job manager passes keys X & Y to each process.
4. Process PID P allocates an endpoint with key X.
5. Kernel agent contacts job manager to see if PID P can use X.
6. Job manager replies yes
7. Kernel agent says, okay, cool. Creates EP, programs X into HW.
If a process is using a single key, this could probably be handled completely outside of libfabric. For example, the kernel agent could request what key to use. If the key is complex enough that a rogue app wouldn't be able to guess at a valid key, it may be safe to skip steps 5-6. But I think keys are usually fairly small. A reason for having multiple keys might be to separate compute from storage traffic.
- Sean
More information about the ofiwg
mailing list