OFA_OpenFabrics_Manager Open Fabric Manager Framework---Potential Use-Case ideas
Aguilar, Michael J.
mjaguil at sandia.gov
Fri Aug 21 10:10:40 PDT 2020
Everyone
We spent the meeting discussing potential Use-Cases for the Open Fabric Manager Framework. In the next couple of weeks, a few of us have committed to researching some of the items on the list below.
On the bottom, after the list, I have put in a sample Use-Case description to work off of.
Mike
Russ will be looking into Kubernetes and Composing (Orchestration?) services, Jeff will look into Aggregators, Mike will look into Slurm, Forest will look into Neutron
Use-Cases
1 Workload Management Tools
a. Slurm
b. PBS/Torque/Moab/Maui
c. Sun Grid Engine
d. Kubernetes
i. Who is the clients? How do we make things happen in containers?
e. OpenStack
f. Neutron
2 Operation Management Tools
a. Dashboards
i. Nagios
ii. Splunk
b. Samplers
i. Ganglia
ii. Nagios
iii. LDMS
c. Security
d. Telemetry
3 Orchestration tools
a. Cloud
i. Azure
ii. Amazon Web Services
b. HPC
4 Storage Managers
5 Parallel Programming
a. PGAS
b. SHMEM
c. MPI
i. OpenMPI
ii. MPICH
iii. MVAPICH
d. Libfabric and UCX
6 SNMP and Redfish
7 Aggregators
a. Odim
8 PowerAPI
9 Workload Management Tools
a. Slurm
b. PBS/Torque/Moab/Maui
c. Sun Grid Engine
d. Kubernetes
i. Who is the clients? How do we make things happen in containers?
e. OpenStack
f. Neutron
10 Operation Management Tools
a. Dashboards
i. Nagios
ii. Splunk
b. Samplers
i. Ganglia
ii. Nagios
iii. LDMS
c. Security
d. Telemetry
11 Orchestration tools
a. Cloud
i. Azure
ii. Amazon Web Services
b. HPC
12 Storage Managers
13 Parallel Programming
a. PGAS
b. SHMEM
c. MPI
i. OpenMPI
ii. MPICH
iii. MVAPICH
d. Libfabric and UCX
14 SNMP and Redfish
15 Aggregators
a. Odim
16 PowerAPI
Example Use-Case Description
Use Case Description
Slurm Create Run Queues
Actors
Fabric Manager, Administrator, Compute nodes, Fabric Endpoints, Ethernet Network
Description
Build run queues for Users to run jobs on
Normal Flow
· Create cluster name
· Open Ethernet communication parts (in/out) for subservient compute nodes
· Notify system to use Munge*tunnels for fast communications and creation of run-time shells.
· Notate place for Slurm logging and maximum job logging
· Communicate with compute nodes to validate their running state-idle, running, drained, dead
· Create and notate a Cgroup to resource limit a group of compute nodes
· Notate Prolog and Epilog scripts to be used in every Slurm job allocation
· Notate Fabric endpoint and compute node hardware health check programs
· Provide a time out for jobs that fail to execute (5 minutes)
· Segregate nodes by like features into queues
Alternate Flow 1
· Create cluster name
· Open Ethernet communication parts (in/out) for subservient compute nodes
· Communications ports are busy, error out
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofa_ofm/attachments/20200821/f6312021/attachment-0001.htm>
More information about the Ofa_ofm
mailing list