Description

This module creates a slurm controller node via the slurm-gcp slurm_controller_instance and slurm_instance_template modules.

More information about Slurm On GCP can be found at the project's GitHub page and in the Slurm on Google Cloud User Guide.

The user guide provides detailed instructions on customizing and enhancing the Slurm on GCP cluster as well as recommendations on configuring the controller for optimal performance at different scales.

Example

- id: slurm_controller
  source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
  use:
  - network
  - homefs
  - compute_partition
  settings:
    machine_type: c2-standard-8

This creates a controller node with the following attributes:

connected to the primary subnetwork of network
the filesystem with the ID homefs (defined elsewhere in the blueprint) mounted
One partition with the ID compute_partition (defined elsewhere in the blueprint)
machine type upgraded from the default c2-standard-4 to c2-standard-8

Live Cluster Reconfiguration

The schedmd-slurm-gcp-v6-controller module supports the reconfiguration of partitions and slurm configuration in a running, active cluster.

To reconfigure a running cluster:

Edit the blueprint with the desired configuration changes
Call gcluster create <blueprint> -w to overwrite the deployment directory
Follow instructions in terminal to deploy

The following are examples of updates that can be made to a running cluster:

Add or remove a partition to the cluster
Resize an existing partition
Attach new network storage to an existing partition

NOTE: Changing the VM machine_type of a partition may not work. It is better to create a new partition and delete the old one.

Custom Images

For more information on creating valid custom images for the controller VM instance or for custom instance templates, see our vm-images.md documentation page.

GPU Support

More information on GPU support in Slurm on GCP and other Cluster Toolkit modules can be found at docs/gpu-support.md

Reservation for Scheduled Maintenance

A maintenance event is when a compute engine stops a VM to perform a hardware or software update which is determined by the host maintenance policy. This can also affect the running jobs if the maintenance kicks in. Now, Customers can protect jobs from getting terminated due to maintenance using the cluster toolkit. You can enable creation of reservation for scheduled maintenance for your compute nodeset and Slurm will reserve your node for maintenance during the maintenance window. If you try to schedule any jobs which overlap with the maintenance reservation, Slurm would not schedule any job.

You can specify in your blueprint like

  - id: compute_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      enable_maintenance_reservation: true

To enable creation of reservation for maintenance.

While running job on slurm cluster, you can specify total run time of the job using -t flag.This would only run the job outside of the maintenance window.

srun -n1 -pcompute -t 10:00 <job.sh>

Currently upcoming maintenance notification is supported in ALPHA version of compute API. You can update the API version from your blueprint,

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    settings:
      endpoint_versions:
        compute: "alpha"

Opportunistic GCP maintenance in Slurm

Customers can also enable running GCP maintenance as Slurm job opportunistically to perform early maintenance. If a node is detected for maintenance, Slurm will create a job to perform maintenance and put it in the job queue.

If backfill scheduler is used, Slurm will backfill maintenance job if it can find any empty time window.

Customer can also choose builtin scheduler type. In this case, Slurm would run maintenance job in strictly priority order. If the maintenance job doesn't kick in, then forced maintenance will take place at scheduled window.

Customer can enable this feature at nodeset level by,

  - id: debug_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      enable_opportunistic_maintenance: true

Placement Max Distance

When using enable_placement with Slurm, Google Compute Engine will attempt to place VMs as physically close together as possible. Capacity constraints at the time of VM creation may still force VMs to be spread across multiple racks. Google provides the max-distance flag which can used to control the maximum spreading allowed. Read more about max-distance in the official docs.

You can use the enable_slurm_gcp_plugins.max_hops.max_hops setting on the controller module to control the max-distance behavior. See the following example:

  - id: controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use: [ network, partition ]
    settings:
      enable_slurm_gcp_plugins:
        max_hops:
          max_hops: 1

Note

schedmd-slurm-gcp-v6-nodeset.settings.enable_placement: true must also be set for max-distance to take effect.

In the above case using a value of 1 will restrict VM to be placed on the same rack. You can confirm that the max-distance was applied by calling the following command while jobs are running:

gcloud beta compute resource-policies list \
  --format='yaml(name,groupPlacementPolicy.maxDistance)'

Warning

If a zone lacks capacity, using a lower max-distance value (such as 1) is more likely to cause VMs creation to fail.

TreeWidth and Node Communication

Slurm uses a fan out mechanism to communicate large groups of nodes. The shape of this fan out tree is determined by the TreeWidth configuration variable.

In the cloud, this fan out mechanism can become unstable when nodes restart with new IP addresses. You can enforce that all nodes communicate directly with the controller by setting TreeWidth to a value >= largest partition.

If the largest partition was 200 nodes, configure the blueprint as follows:

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    ...
    settings:
      cloud_parameters:
        tree_width: 200

The default has been set to 128. Values above this have not been fully tested and may cause congestion on the controller. A more scalable solution is under way.

Hybrid Slurm Clusters

For more information on how to configure an on premise slurm cluster with hybrid cloud partitions, see the schedmd-slurm-gcp-v5-hybrid module and our extended instructions in our docs.

Support

The Cluster Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name	Version
terraform	>= 1.3
google	>= 4.84

Providers

Name	Version
google	>= 4.84

Modules

Name	Source	Version
bucket	terraform-google-modules/cloud-storage/google	~> 6.1
daos_network_storage_scripts	../../../../modules/scripts/startup-script	n/a
nodeset_cleanup	./modules/cleanup_compute	n/a
nodeset_cleanup_tpu	./modules/cleanup_tpu	n/a
slurm_controller_template	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template	6.8.6
slurm_files	./modules/slurm_files	n/a
slurm_login_instance	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/_slurm_instance	6.8.6
slurm_login_template	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template	6.8.6
slurm_nodeset_template	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_instance_template	6.8.6
slurm_nodeset_tpu	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_nodeset_tpu	6.8.6

Resources

Name	Type
google_compute_instance_from_template.controller	resource
google_secret_manager_secret.cloudsql	resource
google_secret_manager_secret_iam_member.cloudsql_secret_accessor	resource
google_secret_manager_secret_version.cloudsql_version	resource
google_storage_bucket_iam_binding.legacy_readers	resource
google_storage_bucket_iam_binding.viewers	resource
google_compute_image.slurm	data source
google_project.this	data source

Inputs

Name	Description	Type	Default	Required
additional_disks	List of maps of disks.	list(object({ disk_name = string device_name = string disk_type = string disk_size_gb = number disk_labels = map(string) auto_delete = bool boot = bool }))	`[]`	no
allow_automatic_updates	If false, disables automatic system package updates on the created instances. This feature is only available on supported images (or images derived from them). For more details, see https://cloud.google.com/compute/docs/instances/create-hpc-vm#disable_automatic_updates	`bool`	`true`	no
bandwidth_tier	Configures the network interface card and the maximum egress bandwidth for VMs. - Setting `platform_default` respects the Google Cloud Platform API default values for networking. - Setting `virtio_enabled` explicitly selects the VirtioNet network adapter. - Setting `gvnic_enabled` selects the gVNIC network adapter (without Tier 1 high bandwidth). - Setting `tier_1_enabled` selects both the gVNIC adapter and Tier 1 high bandwidth networking. - Note: both gVNIC and Tier 1 networking require a VM image with gVNIC support as well as specific VM families and shapes. - See official docs for more details.	`string`	`"platform_default"`	no
bucket_dir	Bucket directory for cluster files to be put into. If not specified, then one will be chosen based on slurm_cluster_name.	`string`	`null`	no
bucket_name	Name of GCS bucket. Ignored when 'create_bucket' is true.	`string`	`null`	no
can_ip_forward	Enable IP forwarding, for NAT instances for example.	`bool`	`false`	no
cgroup_conf_tpl	Slurm cgroup.conf template file path.	`string`	`null`	no
cloud_parameters	cloud.conf options. Defaults inherited from Slurm GCP repo	object({ no_comma_params = optional(bool, false) private_data = optional(list(string)) scheduler_parameters = optional(list(string)) resume_rate = optional(number) resume_timeout = optional(number) suspend_rate = optional(number) suspend_timeout = optional(number) topology_plugin = optional(string) topology_param = optional(string) tree_width = optional(number) })	`{}`	no
cloudsql	Use this database instead of the one on the controller. server_ip : Address of the database server. user : The user to access the database as. password : The password, given the user, to access the given database. (sensitive) db_name : The database to access. user_managed_replication : The list of location and (optional) kms_key_name for secret	object({ server_ip = string user = string password = string # sensitive db_name = string user_managed_replication = optional(list(object({ location = string kms_key_name = optional(string) })), []) })	`null`	no
compute_startup_script	Startup script used by the compute VMs.	`string`	`"# no-op"`	no
compute_startup_scripts_timeout	The timeout (seconds) applied to each script in compute_startup_scripts. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled.	`number`	`300`	no
controller_startup_script	Startup script used by the controller VM.	`string`	`"# no-op"`	no
controller_startup_scripts_timeout	The timeout (seconds) applied to each script in controller_startup_scripts. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled.	`number`	`300`	no
create_bucket	Create GCS bucket instead of using an existing one.	`bool`	`true`	no
deployment_name	Name of the deployment.	`string`	n/a	yes
disable_controller_public_ips	DEPRECATED: Use `enable_controller_public_ips` instead.	`bool`	`null`	no
disable_default_mounts	DEPRECATED: Use `enable_default_mounts` instead.	`bool`	`null`	no
disable_smt	DEPRECATED: Use `enable_smt` instead.	`bool`	`null`	no
disk_auto_delete	Whether or not the boot disk should be auto-deleted.	`bool`	`true`	no
disk_labels	Labels specific to the boot disk. These will be merged with var.labels.	`map(string)`	`{}`	no
disk_size_gb	Boot disk size in GB.	`number`	`50`	no
disk_type	Boot disk type, can be either hyperdisk-balanced, pd-ssd, pd-standard, pd-balanced, or pd-extreme.	`string`	`"pd-ssd"`	no
enable_bigquery_load	Enables loading of cluster job usage into big query. NOTE: Requires Google Bigquery API.	`bool`	`false`	no
enable_cleanup_compute	Enables automatic cleanup of compute nodes and resource policies (e.g. placement groups) managed by this module, when cluster is destroyed. WARNING: Toggling this off will impact the running workload. Deployed compute nodes will be destroyed.	`bool`	`true`	no
enable_confidential_vm	Enable the Confidential VM configuration. Note: the instance image must support option.	`bool`	`false`	no
enable_controller_public_ips	If set to true. The controller will have a random public IP assigned to it. Ignored if access_config is set.	`bool`	`false`	no
enable_debug_logging	Enables debug logging mode.	`bool`	`false`	no
enable_default_mounts	Enable default global network storage from the controller - /home - /apps Warning: If these are disabled, the slurm etc and munge dirs must be added manually, or some other mechanism must be used to synchronize the slurm conf files and the munge key across the cluster.	`bool`	`true`	no
enable_devel	DEPRECATED: `enable_devel` is always on.	`bool`	`null`	no
enable_external_prolog_epilog	Automatically enable a script that will execute prolog and epilog scripts shared by NFS from the controller to compute nodes. Find more details at: https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/tools/prologs-epilogs/README.md	`bool`	`null`	no
enable_oslogin	Enables Google Cloud os-login for user login and authentication for VMs. See https://cloud.google.com/compute/docs/oslogin	`bool`	`true`	no
enable_shielded_vm	Enable the Shielded VM configuration. Note: the instance image must support option.	`bool`	`false`	no
enable_slurm_gcp_plugins	Enables calling hooks in scripts/slurm_gcp_plugins during cluster resume and suspend.	`any`	`false`	no
enable_smt	Enables Simultaneous Multi-Threading (SMT) on instance.	`bool`	`false`	no
endpoint_versions	Version of the API to use (The compute service is the only API currently supported)	object({ compute = string })	{ "compute": "beta" }	no
epilog_scripts	List of scripts to be used for Epilog. Programs for the slurmd to execute on every node when a user's job completes. See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog.	list(object({ filename = string content = optional(string) source = optional(string) }))	`[]`	no
extra_logging_flags	The only available flag is `trace_api`	`map(bool)`	`{}`	no
gcloud_path_override	Directory of the gcloud executable to be used during cleanup	`string`	`""`	no
guest_accelerator	List of the type and count of accelerator cards attached to the instance.	list(object({ type = string, count = number }))	`[]`	no
instance_image	Defines the image that will be used in the Slurm controller VM instance. Expected Fields: name: The name of the image. Mutually exclusive with family. family: The image family to use. Mutually exclusive with name. project: The project where the image is hosted. For more information on creating custom images that comply with Slurm on GCP see the "Slurm on GCP Custom Images" section in docs/vm-images.md.	`map(string)`	{ "family": "slurm-gcp-6-8-hpc-rocky-linux-8", "project": "schedmd-slurm-public" }	no
instance_image_custom	A flag that designates that the user is aware that they are requesting to use a custom and potentially incompatible image for this Slurm on GCP module. If the field is set to false, only the compatible families and project names will be accepted. The deployment will fail with any other image family or name. If set to true, no checks will be done. See: https://goo.gle/hpc-slurm-images	`bool`	`false`	no
instance_template	DEPRECATED: Instance template can not be specified for controller.	`string`	`null`	no
labels	Labels, provided as a map.	`map(string)`	`{}`	no
login_network_storage	An array of network attached storage mounts to be configured on all login nodes.	list(object({ server_ip = string, remote_mount = string, local_mount = string, fs_type = string, mount_options = string, }))	`[]`	no
login_nodes	List of slurm login instance definitions.	list(object({ name_prefix = string access_config = optional(list(object({ nat_ip = string network_tier = string }))) additional_disks = optional(list(object({ disk_name = optional(string) device_name = optional(string) disk_size_gb = optional(number) disk_type = optional(string) disk_labels = optional(map(string), {}) auto_delete = optional(bool, true) boot = optional(bool, false) })), []) additional_networks = optional(list(object({ access_config = optional(list(object({ nat_ip = string network_tier = string })), []) alias_ip_range = optional(list(object({ ip_cidr_range = string subnetwork_range_name = string })), []) ipv6_access_config = optional(list(object({ network_tier = string })), []) network = optional(string) network_ip = optional(string, "") nic_type = optional(string) queue_count = optional(number) stack_type = optional(string) subnetwork = optional(string) subnetwork_project = optional(string) })), []) bandwidth_tier = optional(string, "platform_default") can_ip_forward = optional(bool, false) disable_smt = optional(bool, false) disk_auto_delete = optional(bool, true) disk_labels = optional(map(string), {}) disk_size_gb = optional(number) disk_type = optional(string, "n1-standard-1") enable_confidential_vm = optional(bool, false) enable_oslogin = optional(bool, true) enable_shielded_vm = optional(bool, false) gpu = optional(object({ count = number type = string })) labels = optional(map(string), {}) machine_type = optional(string) metadata = optional(map(string), {}) min_cpu_platform = optional(string) num_instances = optional(number, 1) on_host_maintenance = optional(string) preemptible = optional(bool, false) region = optional(string) service_account = optional(object({ email = optional(string) scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"]) })) shielded_instance_config = optional(object({ enable_integrity_monitoring = optional(bool, true) enable_secure_boot = optional(bool, true) enable_vtpm = optional(bool, true) })) source_image_family = optional(string) source_image_project = optional(string) source_image = optional(string) static_ips = optional(list(string), []) subnetwork = string spot = optional(bool, false) tags = optional(list(string), []) zone = optional(string) termination_action = optional(string) }))	`[]`	no
login_startup_script	Startup script used by the login VMs.	`string`	`"# no-op"`	no
login_startup_scripts_timeout	The timeout (seconds) applied to each script in login_startup_scripts. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled.	`number`	`300`	no
machine_type	Machine type to create.	`string`	`"c2-standard-4"`	no
metadata	Metadata, provided as a map.	`map(string)`	`{}`	no
min_cpu_platform	Specifies a minimum CPU platform. Applicable values are the friendly names of CPU platforms, such as Intel Haswell or Intel Skylake. See the complete list: https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform	`string`	`null`	no
network_storage	An array of network attached storage mounts to be configured on all instances.	list(object({ server_ip = string, remote_mount = string, local_mount = string, fs_type = string, mount_options = string, client_install_runner = optional(map(string)) mount_runner = optional(map(string)) }))	`[]`	no
nodeset	Define nodesets, as a list.	list(object({ node_count_static = optional(number, 0) node_count_dynamic_max = optional(number, 1) node_conf = optional(map(string), {}) nodeset_name = string additional_disks = optional(list(object({ disk_name = optional(string) device_name = optional(string) disk_size_gb = optional(number) disk_type = optional(string) disk_labels = optional(map(string), {}) auto_delete = optional(bool, true) boot = optional(bool, false) })), []) bandwidth_tier = optional(string, "platform_default") can_ip_forward = optional(bool, false) disable_smt = optional(bool, false) disk_auto_delete = optional(bool, true) disk_labels = optional(map(string), {}) disk_size_gb = optional(number) disk_type = optional(string) enable_confidential_vm = optional(bool, false) enable_placement = optional(bool, false) enable_oslogin = optional(bool, true) enable_shielded_vm = optional(bool, false) enable_maintenance_reservation = optional(bool, false) enable_opportunistic_maintenance = optional(bool, false) gpu = optional(object({ count = number type = string })) dws_flex = object({ enabled = bool max_run_duration = number use_job_duration = bool }) labels = optional(map(string), {}) machine_type = optional(string) maintenance_interval = optional(string) instance_properties_json = string metadata = optional(map(string), {}) min_cpu_platform = optional(string) network_tier = optional(string, "STANDARD") network_storage = optional(list(object({ server_ip = string remote_mount = string local_mount = string fs_type = string mount_options = string client_install_runner = optional(map(string)) mount_runner = optional(map(string)) })), []) on_host_maintenance = optional(string) preemptible = optional(bool, false) region = optional(string) service_account = optional(object({ email = optional(string) scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"]) })) shielded_instance_config = optional(object({ enable_integrity_monitoring = optional(bool, true) enable_secure_boot = optional(bool, true) enable_vtpm = optional(bool, true) })) source_image_family = optional(string) source_image_project = optional(string) source_image = optional(string) subnetwork_self_link = string additional_networks = optional(list(object({ network = string subnetwork = string subnetwork_project = string network_ip = string nic_type = string stack_type = string queue_count = number access_config = list(object({ nat_ip = string network_tier = string })) ipv6_access_config = list(object({ network_tier = string })) alias_ip_range = list(object({ ip_cidr_range = string subnetwork_range_name = string })) }))) access_config = optional(list(object({ nat_ip = string network_tier = string }))) spot = optional(bool, false) tags = optional(list(string), []) termination_action = optional(string) reservation_name = optional(string) startup_script = optional(list(object({ filename = string content = string })), []) zone_target_shape = string zone_policy_allow = set(string) zone_policy_deny = set(string) }))	`[]`	no
nodeset_dyn	Defines dynamic nodesets, as a list.	list(object({ nodeset_name = string nodeset_feature = string }))	`[]`	no
nodeset_tpu	Define TPU nodesets, as a list.	list(object({ node_count_static = optional(number, 0) node_count_dynamic_max = optional(number, 5) nodeset_name = string enable_public_ip = optional(bool, false) node_type = string accelerator_config = optional(object({ topology = string version = string }), { topology = "" version = "" }) tf_version = string preemptible = optional(bool, false) preserve_tpu = optional(bool, false) zone = string data_disks = optional(list(string), []) docker_image = optional(string, "") network_storage = optional(list(object({ server_ip = string remote_mount = string local_mount = string fs_type = string mount_options = string client_install_runner = optional(map(string)) mount_runner = optional(map(string)) })), []) subnetwork = string service_account = optional(object({ email = optional(string) scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"]) })) project_id = string reserved = optional(string, false) }))	`[]`	no
on_host_maintenance	Instance availability Policy.	`string`	`"MIGRATE"`	no
partitions	Cluster partitions as a list. See module slurm_partition.	list(object({ partition_name = string partition_conf = optional(map(string), {}) partition_nodeset = optional(list(string), []) partition_nodeset_dyn = optional(list(string), []) partition_nodeset_tpu = optional(list(string), []) enable_job_exclusive = optional(bool, false) }))	n/a	yes
preemptible	Allow the instance to be preempted.	`bool`	`false`	no
project_id	Project ID to create resources in.	`string`	n/a	yes
prolog_scripts	List of scripts to be used for Prolog. Programs for the slurmd to execute whenever it is asked to run a job step from a new job allocation. See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog.	list(object({ filename = string content = optional(string) source = optional(string) }))	`[]`	no
region	The default region to place resources in.	`string`	n/a	yes
service_account	DEPRECATED: Use `service_account_email` and `service_account_scopes` instead.	object({ email = string scopes = set(string) })	`null`	no
service_account_email	Service account e-mail address to attach to the controller instance.	`string`	`null`	no
service_account_scopes	Scopes to attach to the controller instance.	`set(string)`	[ "https://www.googleapis.com/auth/cloud-platform" ]	no
shielded_instance_config	Shielded VM configuration for the instance. Note: not used unless enable_shielded_vm is 'true'. enable_integrity_monitoring : Compare the most recent boot measurements to the integrity policy baseline and return a pair of pass/fail results depending on whether they match or not. enable_secure_boot : Verify the digital signature of all boot components, and halt the boot process if signature verification fails. enable_vtpm : Use a virtualized trusted platform module, which is a specialized computer chip you can use to encrypt objects like keys and certificates.	object({ enable_integrity_monitoring = bool enable_secure_boot = bool enable_vtpm = bool })	{ "enable_integrity_monitoring": true, "enable_secure_boot": true, "enable_vtpm": true }	no
slurm_cluster_name	Cluster name, used for resource naming and slurm accounting. If not provided it will default to the first 8 characters of the deployment name (removing any invalid characters).	`string`	`null`	no
slurm_conf_tpl	Slurm slurm.conf template file path.	`string`	`null`	no
slurmdbd_conf_tpl	Slurm slurmdbd.conf template file path.	`string`	`null`	no
static_ips	List of static IPs for VM instances.	`list(string)`	`[]`	no
subnetwork_self_link	Subnet to deploy to.	`string`	n/a	yes
tags	Network tag list.	`list(string)`	`[]`	no
universe_domain	Domain address for alternate API universe	`string`	`"googleapis.com"`	no
zone	Zone where the instances should be created. If not specified, instances will be spread across available zones in the region.	`string`	`null`	no

Outputs

Name	Description
instructions	Post deployment instructions.
slurm_bucket_path	Bucket path used by cluster.
slurm_cluster_name	Slurm cluster name.
slurm_controller_instance	Compute instance of controller node
slurm_login_instances	Compute instances of login nodes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Description

Example

Live Cluster Reconfiguration

Custom Images

GPU Support

Reservation for Scheduled Maintenance

Opportunistic GCP maintenance in Slurm

Placement Max Distance

TreeWidth and Node Communication

Hybrid Slurm Clusters

Support

License

Requirements

Providers

Modules

Resources

Inputs

Outputs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Description

Example

Live Cluster Reconfiguration

Custom Images

GPU Support

Reservation for Scheduled Maintenance

Opportunistic GCP maintenance in Slurm

Placement Max Distance

TreeWidth and Node Communication

Hybrid Slurm Clusters

Support

License

Requirements

Providers

Modules

Resources

Inputs

Outputs