[Ofmfwg] OpenFabrics Management Framework Working Group results for 4 September--Slurm Spank Batch Use-Case to Fabric Manager Framework

Aguilar, Michael J. mjaguil at sandia.gov
Fri Sep 4 10:21:53 PDT 2020


Everyone

Today, we discussed features and interactions of a potential Slurm Spank plug-in with a new Fabric Manager Framework.  Key features would be to support ‘mini-fabrics’ where we would have endpoints dynamically be correlated with physical, virtual, and container Slurm assignments.  Potential benefits would include:


1)     Beneficial support for server farms using VMs and containers.

2)     Power reduction in future HPC clusters

I will add this current Slurm Batch Use-Case description work up on the Working Group site repository for everyone to interact with over the weekend.  We plan on starting back up on Friday 11 September.

Mike


Use Case Description

Slurm Create Batch Run---Spank Plug-in

Actors

Proprietary Fabric Manager, Administrator, Compute nodes, Fabric Endpoints, Ethernet Network

Description

Build out Nodes and topology for an HPC Batch Run

Normal Flow

·        Batch run submitted to Slurm
·        Init Spank plugin
·        Spank extension sets up Allocator context
·        Set up logging----use the logging capabilities from the fabric provider.
·        Get node list values assigned by Slurm to run the batch job—node_list array
·        Correlate compute node_list (physical or virtual) with fabric endpoints---physical, full virtual and para virtual, container
·        Get service key from list of path choices between the endpoints
·        Provide restrictions and features of batch limitations
·        Get back necessary security keys, etc. from the fabric

Slurm Requirements in/Fabric Manager Responses out---information transferred back and forth

·        Run batch job
o   You are running on this physical mapping, identifier and port on the fabric.
·        Spank-Fini
·        Exit job

Alternate Flow 1

·         Batch run submitted to Slurm
·        Init Spank plugin
·        Spank extension sets up Allocator context
·        Set up logging for connection
·        Get node list values—node_list array
·        Correlate node_list with endpoints
·        Provide routes to nodes
·        Spank detects error
·        Log error
·        Report Errot
·        Exit job



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofmfwg/attachments/20200904/9ba26e53/attachment-0001.htm>


More information about the Ofmfwg mailing list