Skip to content

Latest commit

 

History

History
105 lines (82 loc) · 7.18 KB

File metadata and controls

105 lines (82 loc) · 7.18 KB

Description

This module creates a compute partition that can be used as input to the schedmd-slurm-gcp-v6-controller.

The partition module is designed to work alongside the schedmd-slurm-gcp-v6-nodeset module. A partition can be made up of one or more nodesets, provided either through use (preferred) or defined manually in the nodeset variable.

Example

The following code snippet creates a partition module with:

  • 2 nodesets added via use.
    • The first nodeset is made up of machines of type c2-standard-30.
    • The second nodeset is made up of machines of type c2-standard-60.
    • Both nodesets have a maximum count of 200 dynamically created nodes.
  • partition name of "compute".
  • connected to the network module via use.
  • nodes mounted to homefs via use.
- id: nodeset_1
  source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
  use:
  - network
  settings:
    name: c30
    node_count_dynamic_max: 200
    machine_type: c2-standard-30

- id: nodeset_2
  source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
  use:
  - network
  settings:
    name: c60
    node_count_dynamic_max: 200
    machine_type: c2-standard-60

- id: compute_partition
  source: community/modules/compute/schedmd-slurm-gcp-v6-partition
  use:
  - homefs
  - nodeset_1
  - nodeset_2
  settings:
    partition_name: compute

Support

The Cluster Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

Requirements

Name Version
terraform >= 1.3

Providers

No providers.

Modules

No modules.

Resources

No resources.

Inputs

Name Description Type Default Required
exclusive Exclusive job access to nodes. When set to true nodes execute single job and are deleted
after job exits. If set to false, multiple jobs can be scheduled on one node.
bool true no
is_default Sets this partition as the default partition by updating the partition_conf.
If "Default" is already set in partition_conf, this variable will have no effect.
bool false no
network_storage DEPRECATED
list(object({
server_ip = string,
remote_mount = string,
local_mount = string,
fs_type = string,
mount_options = string,
client_install_runner = map(string)
mount_runner = map(string)
}))
[] no
nodeset A list of nodesets.
For type definition see community/modules/scheduler/schedmd-slurm-gcp-v6-controller/variables.tf::nodeset
list(any) [] no
nodeset_dyn Defines dynamic nodesets, as a list.
list(object({
nodeset_name = string
nodeset_feature = string
}))
[] no
nodeset_tpu Define TPU nodesets, as a list.
list(object({
node_count_static = optional(number, 0)
node_count_dynamic_max = optional(number, 5)
nodeset_name = string
enable_public_ip = optional(bool, false)
node_type = string
accelerator_config = optional(object({
topology = string
version = string
}), {
topology = ""
version = ""
})
tf_version = string
preemptible = optional(bool, false)
preserve_tpu = optional(bool, false)
zone = string
data_disks = optional(list(string), [])
docker_image = optional(string, "")
network_storage = optional(list(object({
server_ip = string
remote_mount = string
local_mount = string
fs_type = string
mount_options = string
})), [])
subnetwork = string
service_account = optional(object({
email = optional(string)
scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])
}))
project_id = string
reserved = optional(string, false)
}))
[] no
partition_conf Slurm partition configuration as a map.
See https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION
map(string) {} no
partition_name The name of the slurm partition. string n/a yes
resume_timeout Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use.
If null is given, then a smart default will be chosen depending on nodesets in partition.
This sets 'ResumeTimeout' in partition_conf.
See https://slurm.schedmd.com/slurm.conf.html#OPT_ResumeTimeout_1 for details.
number 300 no
suspend_time Nodes which remain idle or down for this number of seconds will be placed into power save mode by SuspendProgram.
This sets 'SuspendTime' in partition_conf.
See https://slurm.schedmd.com/slurm.conf.html#OPT_SuspendTime_1 for details.
NOTE: use value -1 to exclude partition from suspend.
NOTE 2: if var.exclusive is set to true (default), nodes are deleted immediately after job finishes.
number 300 no
suspend_timeout Maximum time permitted (in seconds) between when a node suspend request is issued and when the node is shutdown.
If null is given, then a smart default will be chosen depending on nodesets in partition.
This sets 'SuspendTimeout' in partition_conf.
See https://slurm.schedmd.com/slurm.conf.html#OPT_SuspendTimeout_1 for details.
number null no

Outputs

Name Description
nodeset Details of a nodesets in this partition
nodeset_dyn Details of a dynamic nodesets in this partition
nodeset_tpu Details of a TPU nodesets in this partition
partitions Details of a slurm partition