[Ofmfwg] OpenFabrics Management Framework Working Group results for 4 September--Slurm Spank Batch Use-Case to Fabric Manager Framework
Aguilar, Michael J.
mjaguil at sandia.gov
Fri Sep 4 10:21:53 PDT 2020
Everyone
Today, we discussed features and interactions of a potential Slurm Spank plug-in with a new Fabric Manager Framework. Key features would be to support ‘mini-fabrics’ where we would have endpoints dynamically be correlated with physical, virtual, and container Slurm assignments. Potential benefits would include:
1) Beneficial support for server farms using VMs and containers.
2) Power reduction in future HPC clusters
I will add this current Slurm Batch Use-Case description work up on the Working Group site repository for everyone to interact with over the weekend. We plan on starting back up on Friday 11 September.
Mike
Use Case Description
Slurm Create Batch Run---Spank Plug-in
Actors
Proprietary Fabric Manager, Administrator, Compute nodes, Fabric Endpoints, Ethernet Network
Description
Build out Nodes and topology for an HPC Batch Run
Normal Flow
· Batch run submitted to Slurm
· Init Spank plugin
· Spank extension sets up Allocator context
· Set up logging----use the logging capabilities from the fabric provider.
· Get node list values assigned by Slurm to run the batch job—node_list array
· Correlate compute node_list (physical or virtual) with fabric endpoints---physical, full virtual and para virtual, container
· Get service key from list of path choices between the endpoints
· Provide restrictions and features of batch limitations
· Get back necessary security keys, etc. from the fabric
Slurm Requirements in/Fabric Manager Responses out---information transferred back and forth
· Run batch job
o You are running on this physical mapping, identifier and port on the fabric.
· Spank-Fini
· Exit job
Alternate Flow 1
· Batch run submitted to Slurm
· Init Spank plugin
· Spank extension sets up Allocator context
· Set up logging for connection
· Get node list values—node_list array
· Correlate node_list with endpoints
· Provide routes to nodes
· Spank detects error
· Log error
· Report Errot
· Exit job
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofmfwg/attachments/20200904/9ba26e53/attachment-0001.htm>
More information about the Ofmfwg
mailing list