OFA_OpenFabrics_Manager Open Fabric Manager Framework---Potential Use-Case ideas

Fri Aug 21 10:10:40 PDT 2020

Everyone

We spent the meeting discussing potential Use-Cases for the Open Fabric Manager Framework.  In the next couple of weeks, a few of us have committed to researching some of the items on the list below.

On the bottom, after the list, I have put in a sample Use-Case description to work off of.

Mike

Russ will be looking into Kubernetes and Composing (Orchestration?) services, Jeff will look into Aggregators, Mike will look into Slurm, Forest will look into Neutron

Use-Cases

1        Workload Management Tools

a.      Slurm

b.      PBS/Torque/Moab/Maui

c.       Sun Grid Engine

d.      Kubernetes

                                                              i.      Who is the clients?  How do we make things happen in containers?

e.      OpenStack

f.        Neutron

2        Operation Management Tools

a.      Dashboards

                                                              i.      Nagios

                                                            ii.      Splunk

b.      Samplers

                                                              i.      Ganglia

                                                            ii.      Nagios

                                                          iii.      LDMS

c.       Security

d.      Telemetry

3        Orchestration tools

a.      Cloud

                                                              i.      Azure

                                                            ii.      Amazon Web Services

b.      HPC

4        Storage Managers

5        Parallel Programming

a.      PGAS

b.      SHMEM

c.       MPI

                                                              i.      OpenMPI

                                                            ii.      MPICH

                                                          iii.      MVAPICH

d.      Libfabric and UCX

6        SNMP and Redfish

7        Aggregators

a.      Odim

8        PowerAPI

9        Workload Management Tools
a.     Slurm
b.     PBS/Torque/Moab/Maui
c.      Sun Grid Engine
d.     Kubernetes
                                                              i.      Who is the clients?  How do we make things happen in containers?
e.     OpenStack
f.       Neutron
10   Operation Management Tools
a.     Dashboards
                                                              i.      Nagios
                                                            ii.      Splunk
b.     Samplers
                                                              i.      Ganglia
                                                            ii.      Nagios
                                                          iii.      LDMS
c.      Security
d.     Telemetry
11   Orchestration tools
a.     Cloud
                                                              i.      Azure
                                                            ii.      Amazon Web Services
b.     HPC
12   Storage Managers
13   Parallel Programming
a.     PGAS
b.     SHMEM
c.      MPI
                                                              i.      OpenMPI
                                                            ii.      MPICH
                                                          iii.      MVAPICH
d.     Libfabric and UCX
14   SNMP and Redfish
15   Aggregators
a.     Odim
16   PowerAPI

Example Use-Case Description

Use Case Description

Slurm Create Run Queues

Actors

Fabric Manager, Administrator, Compute nodes, Fabric Endpoints, Ethernet Network

Description

Build run queues for Users to run jobs on

Normal Flow

·        Create cluster name
·        Open Ethernet communication parts (in/out) for subservient compute nodes
·        Notify system to use Munge*tunnels for fast communications and creation of run-time shells.
·        Notate place for Slurm logging and maximum job logging
·        Communicate with compute nodes to validate their running state-idle, running, drained, dead
·        Create and notate a Cgroup to resource limit a group of compute nodes
·        Notate Prolog and Epilog scripts to be used in every Slurm job allocation
·        Notate Fabric endpoint and compute node hardware health check programs
·        Provide a time out for jobs that fail to execute (5 minutes)
·        Segregate nodes by like features into queues

Alternate Flow 1

·        Create cluster name
·        Open Ethernet communication parts (in/out) for subservient compute nodes
·        Communications ports are busy, error out

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofa_ofm/attachments/20200821/f6312021/attachment-0001.htm>