Skip to content

Commit

Permalink
Move examples/hpc-slurm to V6 (#2097)
Browse files Browse the repository at this point in the history
pick f88a30f Unify usage and rendering of `HintError`

* Move `examples/hpc-slurm` to V6;
* Updated `examples/README`;
* Remove `slurm-v5-hpc-centos7` test.
  • Loading branch information
mr0re1 authored Jan 10, 2024
1 parent e9728b0 commit a75c5b5
Show file tree
Hide file tree
Showing 7 changed files with 37 additions and 255 deletions.
87 changes: 0 additions & 87 deletions community/examples/hpc-slurm6.yaml

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,6 @@ This creates a Slurm login node which is:
`use`
* of VM machine type `n2-standard-4`

For a complete example using this module, see
[hpc-slurm.yaml](../../../../examples/hpc-slurm.yaml).

## Custom Images

For more information on creating valid custom images for the login node VM
Expand Down
36 changes: 2 additions & 34 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /"
* [Blueprint Descriptions](#blueprint-descriptions)
* [hpc-slurm.yaml](#hpc-slurmyaml-) ![core-badge]
* [hpc-enterprise-slurm.yaml](#hpc-enterprise-slurmyaml-) ![core-badge]
* [hpc-slurm6.yaml](#hpc-slurm6yaml--) ![community-badge] ![experimental-badge]
* [hpc-slurm6-tpu.yaml](#hpc-slurm6-tpuyaml--) ![community-badge] ![experimental-badge]
* [ml-slurm.yaml](#ml-slurmyaml-) ![core-badge]
* [image-builder.yaml](#image-builderyaml-) ![core-badge]
Expand Down Expand Up @@ -118,13 +117,11 @@ the experimental badge (![experimental-badge]).

### [hpc-slurm.yaml] ![core-badge]

> **Warning**: The variables `enable_reconfigure`,
> `enable_cleanup_compute`, and `enable_cleanup_subscriptions`, if set to
> `true`, require additional dependencies **to be installed on the system deploying the infrastructure**.
> **Warning**: Requires additional dependencies **to be installed on the system deploying the infrastructure**.
>
> ```shell
> # Install Python3 and run
> pip3 install -r https://raw.githubusercontent.com/SchedMD/slurm-gcp/5.9.1/scripts/requirements.txt
> pip3 install -r https://raw.githubusercontent.com/GoogleCloudPlatform/slurm-gcp/6.2.1/scripts/requirements.txt
> ```

Creates a basic auto-scaling Slurm cluster with mostly default settings. The
Expand Down Expand Up @@ -265,35 +262,6 @@ to 256

[hpc-enterprise-slurm.yaml]: ./hpc-enterprise-slurm.yaml

### [hpc-slurm6.yaml] ![community-badge] ![experimental-badge]

> **Warning**: Requires additional dependencies **to be installed on the system deploying the infrastructure**.
>
> ```shell
> # Install Python3 and run
> pip3 install -r https://raw.githubusercontent.com/GoogleCloudPlatform/slurm-gcp/6.2.1/scripts/requirements.txt
> ```

Creates a basic auto-scaling Slurm cluster with mostly default settings. The
blueprint also creates a new VPC network, and a filestore instance mounted to
`/home`.

There are 2 partitions in this example: `debug`, and `compute`. The `debug`
partition uses `n2-standard-2` VMs, which should work out of the box without
needing to request additional quota. The purpose of the `debug` partition is to
make sure that first time users are not immediately blocked by quota
limitations.

[hpc-slurm6.yaml]: ../community/examples/hpc-slurm6.yaml

#### Compute Partition

There is a `compute` partition that achieves higher performance. Any
performance analysis should be done on the `compute` partition. By default it
uses `c2-standard-60` VMs with placement groups enabled. You may need to request
additional quota for `C2 CPUs` in the region you are deploying in. You can
select the compute partition using the `-p compute` argument when running `srun`.

### [hpc-slurm6-tpu.yaml] ![community-badge] ![experimental-badge]

> **Warning**: Requires additional dependencies **to be installed on the system deploying the infrastructure**.
Expand Down
66 changes: 33 additions & 33 deletions examples/hpc-slurm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ blueprint_name: hpc-slurm

vars:
project_id: ## Set GCP Project ID Here ##
deployment_name: hpc-small
deployment_name: hpc-slurm
region: us-central1
zone: us-central1-a

Expand All @@ -28,53 +28,54 @@ vars:
deployment_groups:
- group: primary
modules:
# Source is an embedded resource, denoted by "resources/*" without ./, ../, /
# as a prefix. To refer to a local resource, prefix with ./, ../ or /
# Example - ./resources/network/vpc
- id: network1
# Source is an embedded module, denoted by "modules/*" without ./, ../, /
# as a prefix. To refer to a local module, prefix with ./, ../ or /
# Example - ./modules/network/vpc
- id: network
source: modules/network/vpc

- id: homefs
source: modules/file-system/filestore
use: [network1]
use: [network]
settings:
local_mount: /home

- id: debug_node_group
source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
- id: debug_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
machine_type: n2-standard-2
enable_placement: false # the default is: true

- id: debug_partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use:
- network1
- homefs
- debug_node_group
- debug_nodeset
settings:
partition_name: debug
exclusive: false # allows nodes to stay up after jobs are done
enable_placement: false # the default is: true
is_default: true

- id: compute_node_group
source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
- id: compute_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 20
bandwidth_tier: gvnic_enabled

- id: compute_partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use:
- network1
- homefs
- compute_node_group
- compute_nodeset
settings:
partition_name: compute

- id: h3_node_group
source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
- id: h3_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 20
machine_type: h3-standard-88
Expand All @@ -84,30 +85,29 @@ deployment_groups:
bandwidth_tier: gvnic_enabled

- id: h3_partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use:
- network1
- homefs
- h3_node_group
- h3_nodeset
settings:
partition_name: h3

- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [network]
settings:
name_prefix: login
machine_type: n2-standard-4
disable_login_public_ips: false

- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use:
- network1
- network
- debug_partition
- compute_partition
- h3_partition
- homefs
- slurm_login
settings:
disable_controller_public_ips: false

- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
use:
- network1
- slurm_controller
settings:
machine_type: n2-standard-4
disable_login_public_ips: false
54 changes: 0 additions & 54 deletions tools/cloud-build/daily-tests/builds/slurm-gcp-v5-hpc-centos7.yaml

This file was deleted.

43 changes: 0 additions & 43 deletions tools/cloud-build/daily-tests/tests/slurm-v5-hpc-centos7.yml

This file was deleted.

3 changes: 2 additions & 1 deletion tools/cloud-build/daily-tests/tests/slurm-v6-rocky8.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,14 @@ cli_deployment_vars:

zone: us-west4-c
workspace: /workspace
blueprint_yaml: "{{ workspace }}/community/examples/hpc-slurm6.yaml"
blueprint_yaml: "{{ workspace }}/examples/hpc-slurm.yaml"
network: "{{ deployment_name }}-net"
max_nodes: 5
# Note: Pattern matching in gcloud only supports 1 wildcard, a*-login-* won't work.
login_node: "{{ slurm_cluster_name }}-login-*"
controller_node: "{{ slurm_cluster_name }}-controller"
post_deploy_tests:
- test-validation/test-mounts.yml
- test-validation/test-partitions.yml
custom_vars:
partitions:
Expand Down

0 comments on commit a75c5b5

Please sign in to comment.