Skip to content

ncsa/puppet-profile_slurm

Repository files navigation

profile_slurm

pdk-validate yamllint

NCSA Customizations for Slurm

Table of Contents

  1. Description
  2. Setup
  3. Usage
  4. Dependencies
  5. Reference

Description

This is a module to add NCSA customizations to a Slurm client/scheduler/monitor node. It does not do the initial install/config of Slurm, that is handled by other modules (see treydock/slurm and treydock/slurm_providers )

Currently this module:

  • Adds more fine grained controls for firewalls, so will want to set slurm::manage_firewall: false from treydock/slurm to false
  • Allows telegraf scripts to be deployed which collect Slurm metrics.

Setup

For a Slurm submit node (generally cluster login nodes):

include ::profile_slurm::client

For a Slurm compute node:

include ::profile_slurm::client
include ::profile_slurm::compute

For a Slurm scheduler:

include ::profile_slurm::scheduler

For a Slurm monitor node (Include this on the node that runs slurmdbd):

include ::profile_slurm::monitor

For a node with slurmrestd:

include ::profile_slurm::slurmrestd

NOTE: This has NOT been tested on a standalone node, only on a scheduler. NOTE: Having slurmrestd to listen on the internet is not a great idea, and it's very bad if there is no web proxy in front of it for additional security. Configuration of such a proxy is not covered by this profile.

Usage

You will want to set these hiera variables for Slurm submit nodes (generally cluster login nodes):

profile_slurm::client::firewall::sources:
  - "192.168.0.0/24"  # Allow access to slurmd from 192.168.0.0/24 net

You will generally want to manage the firewall as well as dependencies on local storage for Slurm compute nodes:

profile_slurm::client::firewall::sources:
  - "192.168.0.0/24"  # Allow access to slurmd from 192.168.0.0/24 net
profile_slurm::compute::dependencies:
  - "Gpfs::Bindmount['/scratch']"
...
  - "Gpfs::Nativemount['cluster']"
  - "Lvm::Logical_volume[local]"
  - "Profile_lustre::Nativemount_resource['/projects']"
...
profile_slurm::compute::storage::tmpfs_dir: "/local/slurmjobs"
profile_slurm::compute::storage::tmpfs_dir_refreshed_by: "Lvm::Logical_volume[local]"

You will want to set these hiera variables for scheduler nodes:

profile_slurm::scheduler::firewall::sources:
  - "192.168.0.0/24"  # Allow access to slurmdbd and slurmctld from 192.168.0.0/24 net
# and generally specify dependencies on local storage:
profile_slurm::scheduler::dependencies:
  - "Mount['/slurm']"
  - "Mount['/var/log/slurm']"
  - "Mount['/var/spool/slurmctld.state']"
# lower this timeout from default of 60 sec; this prevents Puppet runs from
# taking an excessive amount of time if the id_check fails (in which case
# slurmctld will fail to start but Puppet will try to contact slurmctld
# until this timeout is reached, prior to running 'scontrol reconfig')
slurm::slurmctld_conn_validator_timeout: 20

To set up slurmrestd on a scheduler, configure these Hiera variables:

# common.yaml:
## make slurmrestd dependent on slurmctld
profile_slurm::slurmrestd::dependencies:
  - "Service['slurmctld']"
## open up the firewall, if needed
profile_slurm::slurmrestd::firewall_sources:
  - "A.B.C.D/XX"  # cluster network
slurm::auth_alt_types:
  - auth/jwt
## optional, but suggested to explicitly set this
slurm::slurmrestd_disable_token_creation: false

# role/slurm_scheduler.yaml:
slurm::slurmrestd: true

# <encrypted and stored in eyaml>:
slurm::jwt_key_content: ...

# node/<scheduler_hostname>.yaml:
## suggested to restrict slurmrestd to the cluster network
slurm::slurmrestd_listen_address: 172.31.2.8

To install arbitrary files and symlinks, set this Hiera variable:

profile_slurm::files::files:
  ...

Telegraf Monitoring

You will want to set these hiera variables for a node running the telegraf monitoring:

profile_slurm::telegraf::telegraf::slurm_job_table: ""

If using MySQL with socket authentication, you'll also need to set the following because the telegraf user cannot use socket auth to connect as another user:

profile_slurm::telegraf::slurm_username: "telegraf"

If using MySQL without socket authentication, you'll also need to set the following to lookup the password:

profile_slurm::telegraf::telegraf::slurm_password: "%{lookup('slurm::slurmdbd_storage_pass')}"  # This is a VAULT lookup, use the keyname you have chosen for storing the slurmdb user account password

Dependencies

n/a

Reference

See: REFERENCE.md