Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v1.6.0 #600

Merged
merged 86 commits into from
Oct 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
46dac6f
Address permadiff in vm-instance module
tpdownes Sep 12, 2022
e8848c1
Exposing enable_reconfigure in Slurm-onGCP V5
cboneti Sep 10, 2022
5fa294e
Merge branch 'develop' of github.com:GoogleCloudPlatform/hpc-toolkit …
cboneti Sep 12, 2022
a8e6d5f
Add customized version output
thiagosgobe Sep 8, 2022
19c57c2
handling detached HEAD scenarios
thiagosgobe Sep 9, 2022
f33ec26
Add git checks into Makefile
thiagosgobe Sep 9, 2022
5a84b91
Merge pull request #532 from thiagosgobe/debugging_improvement
nick-stroud Sep 12, 2022
4fd22e3
Run shell runners as executable
nick-stroud Sep 12, 2022
14cc4eb
Merge pull request #542 from tpdownes/fix_guest_accelerator
tpdownes Sep 12, 2022
a3f3beb
Merge pull request #537 from cboneti/reconfigure
cboneti Sep 12, 2022
a329d75
Adding Slurm on GCP V4 static nodes functionality
cboneti Sep 13, 2022
90b79c9
Merge pull request #544 from cboneti/v4-static-nodes
cboneti Sep 13, 2022
5455177
Set enable_smt default to false for slurm v5 modules
heyealex Sep 13, 2022
47b4c7c
Default scope now allows reading AND writing.
sandwichmaker Sep 13, 2022
02cbdbf
Merge pull request #545 from heyealex/smt-default-slurm-gcp-v5
heyealex Sep 13, 2022
0dca67a
remove "kind:" from examples and docs where optional
kkr16 Sep 13, 2022
1ff7ecd
Add interactive argument to qsim script to make conda accessable
nick-stroud Sep 13, 2022
75a28b0
Add troubleshooting for Slurm: network is unreachable
nick-stroud Sep 14, 2022
afa8f3c
Merge branch 'develop' into patch-1 - incorporate tflint fix
nick-stroud Sep 14, 2022
cc76315
Add auto-delete boot disk as an option on vm-instance
nick-stroud Sep 13, 2022
05d4a86
Merge pull request #543 from nick-stroud/shell_as_executable
nick-stroud Sep 14, 2022
8da491b
Fix README.md to match variables.tf
sandwichmaker Sep 14, 2022
ce40ada
Update go.mod and go.sum
sandwichmaker Sep 14, 2022
bbd7748
Default slurm_cluster_name to deploy name in hybrid
heyealex Sep 14, 2022
ea68256
Merge pull request #548 from nick-stroud/vm-instance_auto-delete-boot…
nick-stroud Sep 14, 2022
60fe706
Merge pull request #550 from heyealex/hybrid-slurm-no-cluster-name
heyealex Sep 14, 2022
1657e45
Merge pull request #546 from sandwichmaker/patch-1
nick-stroud Sep 14, 2022
ff76359
Upgrade DDN-EXAScaler to v6.1.0
nick-stroud Sep 15, 2022
f88ccc1
Add Epilog/Prolog scripts to install path in hybrid
heyealex Sep 15, 2022
fa971c8
Merge pull request #551 from nick-stroud/upgrade_ddn
nick-stroud Sep 15, 2022
925880b
Integrate DDN Lustre install script with startup-script
nick-stroud Sep 15, 2022
57ff26a
Merge pull request #549 from nick-stroud/document_slurm_requires_publ…
nick-stroud Sep 15, 2022
6140385
Make install path sed command more readable
heyealex Sep 15, 2022
8ae8c9c
Merge pull request #552 from heyealex/install-path-include-epilog-prolog
heyealex Sep 16, 2022
146ae5b
remove "kind:" from examples and docs where optional
kkr16 Sep 16, 2022
94a9a3a
remove "kind:" from examples and docs where optional
kkr16 Sep 16, 2022
8493619
Merge branch 'GoogleCloudPlatform:develop' into develop
kkr16 Sep 16, 2022
36da913
Bump cloud.google.com/go/compute from 1.9.0 to 1.10.0
dependabot[bot] Sep 15, 2022
93df5a3
Upgrade Cloud Storage Go module
tpdownes Sep 16, 2022
d5ceb3a
Warn users about deprecated 'name' argument for EXAScaler image
nick-stroud Sep 16, 2022
c44b7a4
Rename EXAScaler output to clarify it is a script
nick-stroud Sep 16, 2022
9c9b664
Merge pull request #553 from nick-stroud/ddn_lustre_install
nick-stroud Sep 16, 2022
83b8ed9
Merge branch 'develop' into develop
heyealex Sep 16, 2022
2ee9d43
Merge pull request #555 from nick-stroud/ddn_image_warning
nick-stroud Sep 16, 2022
8e07b23
Merge pull request #547 from kkr16/develop
heyealex Sep 16, 2022
31a2c66
Add all gcp hybrid slurm demo instructions
heyealex Sep 13, 2022
b5478d4
Add requirements file for pip dependencies
heyealex Sep 16, 2022
09f223b
Merge pull request #554 from GoogleCloudPlatform/dependabot/go_module…
tpdownes Sep 16, 2022
8676bcf
Address an idempotency in Spack install script
tpdownes Sep 16, 2022
e2988fa
Merge pull request #557 from tpdownes/fix_spack_setup
tpdownes Sep 16, 2022
f23dbf1
Eliminate 1 git checkout during Spack install
tpdownes Sep 16, 2022
75bd872
Replace Spack installation in AMD example with a builder VM
tpdownes Sep 16, 2022
c35ee99
Merge pull request #558 from tpdownes/amd_spack_builder
tpdownes Sep 16, 2022
66e94d4
Merge pull request #559 from tpdownes/spack_speedy
tpdownes Sep 16, 2022
dc01b1d
Enable ddn lustre client install with pre-existing-network-storage
nick-stroud Sep 16, 2022
670393e
Address dependency checker timeout failure
tpdownes Sep 19, 2022
b80212f
Merge pull request #561 from tpdownes/bugfix/dependency_checker_timeout
tpdownes Sep 19, 2022
95680b1
Add directory README to hybrid docs, update networks blueprint
heyealex Sep 19, 2022
c7b2bfa
Add install lustre from pre-existing-network-storage to integration t…
nick-stroud Sep 19, 2022
99bcc8f
Add retry loop for installing DDN client setup tool
nick-stroud Sep 19, 2022
1f6c7d7
Update pre-existing-network-storage output runners to have determinis…
nick-stroud Sep 19, 2022
905a2c1
Allow git:: as a valid source prefix
heyealex Sep 15, 2022
443149e
Add documentation and tests for git:: source
heyealex Sep 16, 2022
a802085
Change github source to general git
heyealex Sep 19, 2022
043b323
Ensure deployment group directory is created
heyealex Sep 19, 2022
f951d1d
Merge branch 'main' into develop
tpdownes Sep 20, 2022
8c3a9b2
Merge pull request #560 from nick-stroud/pre_existng_lustre_install
nick-stroud Sep 20, 2022
bd73407
Various fixes, change in cluster name
heyealex Sep 20, 2022
9d4712c
Break out group directory creation into function
heyealex Sep 20, 2022
29628d0
Merge pull request #564 from heyealex/allow-generic-git-sources
heyealex Sep 21, 2022
d421eb9
Merge pull request #556 from heyealex/doc/hybrid-slurm
heyealex Sep 21, 2022
15d7f02
Use fully qualified Ansible resource names
tpdownes Sep 20, 2022
4b5874f
Fix fully qualified name for Ansible resource
tpdownes Sep 21, 2022
0e4d36c
Perform regular cleanup of Filestore VPC network peerings
tpdownes Sep 22, 2022
a2c4503
Merge pull request #567 from tpdownes/speed_ansible_installation
tpdownes Sep 22, 2022
36cd5b3
Merge pull request #568 from tpdownes/fix_filestore_peering_limit
tpdownes Sep 22, 2022
31ae677
Fix filestore peering network cleanup script
tpdownes Sep 26, 2022
2538146
Remove default URLs from Spack tutorials
tpdownes Sep 26, 2022
60265f1
Avoid spurious errors in Spack log
tpdownes Sep 26, 2022
729f8d0
Merge pull request #571 from tpdownes/fix_spack_spurious_error
tpdownes Sep 26, 2022
b0ba142
Merge pull request #570 from tpdownes/fix_remove_default_urls
tpdownes Sep 26, 2022
79a27ed
Merge pull request #569 from tpdownes/fix_filestore_peering_limit_script
tpdownes Sep 26, 2022
58b3dd5
Fix Ansible module for upgrading setuptools in HTCondor autoscaler
tpdownes Sep 29, 2022
d62390a
Merge pull request #581 from tpdownes/fix_htcondor_autoscaler
tpdownes Sep 29, 2022
7518131
Update version to 1.6.0
heyealex Oct 4, 2022
8878f3f
Merge pull request #599 from heyealex/version/1.6.0
heyealex Oct 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,21 @@ ENG = ./cmd/... ./pkg/...
TERRAFORM_FOLDERS=$(shell find ./modules ./community/modules ./tools -type f -name "*.tf" -not -path '*/\.*' -exec dirname "{}" \; | sort -u)
PACKER_FOLDERS=$(shell find ./modules ./community/modules ./tools -type f -name "*.pkr.hcl" -not -path '*/\.*' -exec dirname "{}" \; | sort -u)

ifneq (, $(shell which git))
## GIT IS PRESENT
ifneq (,$(wildcard .git))
## GIT DIRECTORY EXISTS
GIT_TAG_VERSION=$(shell git tag --points-at HEAD)
GIT_BRANCH=$(shell git branch --show-current)
GIT_COMMIT_INFO=$(shell git describe --tags --dirty --long)
endif
endif

# RULES MEANT TO BE USED DIRECTLY

ghpc: warn-go-version warn-terraform-version warn-packer-version $(shell find ./cmd ./pkg ghpc.go -type f)
$(info **************** building ghpc ************************)
go build ghpc.go
@go build -ldflags="-X 'main.gitTagVersion=$(GIT_TAG_VERSION)' -X 'main.gitBranch=$(GIT_BRANCH)' -X 'main.gitCommitInfo=$(GIT_COMMIT_INFO)'" ghpc.go

install-user:
$(info ******** installing ghpc in ~/bin *********************)
Expand Down
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,59 @@ In the right side, expand the Filters view and then filter by label, specifying

## Troubleshooting

### Network is unreachable (Slurm V5)

Slurm requires access to google APIs to function. This can be achieved through one of the following methods:

1. Create a [Cloud NAT](https://cloud.google.com/nat) (preferred).
2. Setting `disable_controller_public_ips: false` &
`disable_login_public_ips: false` on the controller and login nodes
respectively.
3. Enable
[private access to Google APIs](https://cloud.google.com/vpc/docs/private-access-options).

By default the Toolkit VPC module will create an associated Cloud NAT so this is
typically seen when working with the pre-existing-vpc module. If no access
exists you will see the following errors:

When you ssh into the login node or controller you will see the following
message:

```text
*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***
```

> **_NOTE:_**: Many different potential issues could be indicated by the above
> message, so be sure to verify issue in logs.

To confirm the issue, ssh onto the controller and call `sudo cat /slurm/scripts/setup.log`. Look for
the following logs:

```text
google_metadata_script_runner: startup-script: ERROR: [Errno 101] Network is unreachable
google_metadata_script_runner: startup-script: OSError: [Errno 101] Network is unreachable
google_metadata_script_runner: startup-script: ERROR: Aborting setup...
google_metadata_script_runner: startup-script exit status 0
google_metadata_script_runner: Finished running startup scripts.
```

You may also notice mount failure logs on the login node:

```text
INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
INFO: Waiting for '/home' to be mounted...
INFO: Waiting for '/opt/apps' to be mounted...
INFO: Waiting for '/etc/munge' to be mounted...
ERROR: mount of path '/usr/local/etc/slurm' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/usr/local/etc/slurm']' returned non-zero exit status 32.
ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
ERROR: mount of path '/etc/munge' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/etc/munge']' returned non-zero exit status 32.
```

> **_NOTE:_**: The above logs only indicate that something went wrong with the
> startup of the controller. Check logs on the controller to be sure it is a
> network issue.

### Failure to Create Auto Scale Nodes (Slurm)

If your deployment succeeds but your jobs fail with the following error:
Expand Down
28 changes: 26 additions & 2 deletions cmd/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,16 @@ import (
"github.com/spf13/cobra"
)

// Git references when use Makefile
var (
rootCmd = &cobra.Command{
GitTagVersion string
GitBranch string
GitCommitInfo string
)

var (
annotation = make(map[string]string)
rootCmd = &cobra.Command{
Use: "ghpc",
Short: "A blueprint and deployment engine for HPC clusters in GCP.",
Long: `gHPC provides a flexible and simple to use interface to accelerate
Expand All @@ -34,12 +42,28 @@ HPC deployments on the Google Cloud Platform.`,
log.Fatalf("cmd.Help function failed: %s", err)
}
},
Version: "v1.5.0",
Version: "v1.6.0",
Annotations: annotation,
}
)

// Execute the root command
func Execute() error {
if len(GitCommitInfo) > 0 {
if len(GitTagVersion) == 0 {
GitTagVersion = "- not built from oficial release"
}
if len(GitBranch) == 0 {
GitBranch = "detached HEAD"
}
annotation["version"] = GitTagVersion
annotation["branch"] = GitBranch
annotation["commitInfo"] = GitCommitInfo
rootCmd.SetVersionTemplate(`ghpc version {{index .Annotations "version"}}
Built from '{{index .Annotations "branch"}}' branch.
Commit info: {{index .Annotations "commitInfo"}}
`)
}
return rootCmd.Execute()
}

Expand Down
9 changes: 2 additions & 7 deletions community/examples/AMD/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,20 +75,15 @@ remounted and that you should logout and login. Follow its instructions.
Once configuration is complete, install AOCC by running:

```shell
sudo -i bash /var/tmp/install_aocc.sh
sudo bash /var/tmp/install_aocc.sh
```

Spack will prompt you to accept the AOCC End User License Agreement by opening a
text file containing information about the license. Leave the file unmodified
and write it to disk by typing `:q` as two characters in sequence
([VI help][vihelp]).

Installation of AOCC and OpenMPI will take approximately 15 minutes. Once they
are installed, you can install additional packages such as `amdblis`:

```shell
sudo -i spack -d install -v amdblis %[email protected]
```
Installation of AOCC and OpenMPI will take approximately 15 minutes.

Configure SSH user keys for access between cluster nodes:

Expand Down
28 changes: 27 additions & 1 deletion community/examples/AMD/hpc-cluster-amd-slurmv5.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,20 +65,46 @@ deployment_groups:
- type: shell
source: modules/startup-script/examples/install_ansible.sh
destination: install_ansible.sh
- $(swfs.install_nfs_client_runner)
- $(swfs.mount_runner)
- $(spack.install_spack_deps_runner)
- $(spack.install_spack_runner)
- type: shell
content: "shutdown -h +1"
destination: shutdown.sh

- id: slurm_startup
source: modules/scripts/startup-script
settings:
runners:
- type: data
destination: /etc/profile.d/spack.sh
content: |
#!/bin/sh
if [ -f /sw/spack/share/spack/setup-env.sh ]; then
. /sw/spack/share/spack/setup-env.sh
fi
# the following installation of AOCC may be automated in the future
# with a clear direction to the user to read the EULA at
# https://developer.amd.com/aocc-compiler-eula/
- type: data
destination: /var/tmp/install_aocc.sh
content: |
#!/bin/bash
source /sw/spack/share/spack/setup-env.sh
spack install [email protected] +license-agreed
spack load [email protected]
spack compiler find --scope site
spack -d install -v [email protected] %[email protected] +legacylaunchers +pmi schedulers=slurm

# must restart vm to re-initiate subsequent installs
- id: spack_builder
source: modules/compute/vm-instance
use: [network1, swfs, spack-startup]
settings:
name_prefix: spack-builder
machine_type: c2d-standard-16

- id: low_cost_partition
source: community/modules/compute/schedmd-slurm-gcp-v5-partition
use:
Expand Down Expand Up @@ -118,6 +144,6 @@ deployment_groups:
use:
- network1
- slurm_controller
- spack-startup
- slurm_startup
settings:
machine_type: c2d-standard-4
5 changes: 0 additions & 5 deletions community/examples/cloud-batch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,14 @@ deployment_groups:
modules:
- id: network1
source: modules/network/pre-existing-vpc
kind: terraform

- id: appfs
source: modules/file-system/filestore
kind: terraform
use: [network1]
settings: {local_mount: /sw}

- id: hello-startup-script
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -55,7 +52,6 @@ deployment_groups:

- id: batch-job
source: community/modules/scheduler/cloud-batch-job
kind: terraform
use: [network1, appfs, hello-startup-script]
settings:
runnable: "cat /sw/hello.txt"
Expand All @@ -66,6 +62,5 @@ deployment_groups:

- id: batch-login
source: community/modules/scheduler/cloud-batch-login-node
kind: terraform
use: [batch-job]
outputs: [instructions]
6 changes: 0 additions & 6 deletions community/examples/hpc-cluster-small-sharedvpc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,13 @@ deployment_groups:
modules:
- id: network1
source: modules/network/pre-existing-vpc
kind: terraform
settings:
project_id: $(vars.host_project_id)
network_name: your-shared-network
subnetwork_name: your-shared-subnetwork

- id: homefs
source: modules/file-system/filestore
kind: terraform
use: [network1]
settings:
local_mount: /home
Expand All @@ -61,7 +59,6 @@ deployment_groups:
# This debug_partition will work out of the box without requesting additional GCP quota.
- id: debug_partition
source: community/modules/compute/SchedMD-slurm-on-gcp-partition
kind: terraform
use:
- network1
- homefs
Expand All @@ -75,7 +72,6 @@ deployment_groups:
# This compute_partition is far more performant than debug_partition but may require requesting GCP quotas first.
- id: compute_partition
source: community/modules/compute/SchedMD-slurm-on-gcp-partition
kind: terraform
use:
- network1
- homefs
Expand All @@ -85,7 +81,6 @@ deployment_groups:

- id: slurm_controller
source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller
kind: terraform
use:
- network1
- homefs
Expand All @@ -97,7 +92,6 @@ deployment_groups:

- id: slurm_login
source: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node
kind: terraform
use:
- network1
- homefs
Expand Down
10 changes: 0 additions & 10 deletions community/examples/htcondor-pool.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@ deployment_groups:
modules:
- id: network1
source: modules/network/vpc
kind: terraform
settings:
network_name: htcondor-pool
subnetwork_name: htcondor-pool-usc1
Expand All @@ -38,21 +37,17 @@ deployment_groups:

- id: htcondor_install
source: community/modules/scripts/htcondor-install
kind: terraform

- id: htcondor_services
source: community/modules/project/service-enablement
kind: terraform
use:
- htcondor_install

- id: htcondor_configure
source: community/modules/scheduler/htcondor-configure
kind: terraform

- id: htcondor_configure_central_manager
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -63,7 +58,6 @@ deployment_groups:

- id: htcondor_cm
source: modules/compute/vm-instance
kind: terraform
use:
- network1
- htcondor_configure_central_manager
Expand All @@ -80,7 +74,6 @@ deployment_groups:

- id: htcondor_configure_execute_point
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -91,7 +84,6 @@ deployment_groups:

- id: htcondor_execute_point
source: community/modules/compute/htcondor-execute-point
kind: terraform
use:
- network1
- htcondor_configure_execute_point
Expand All @@ -106,7 +98,6 @@ deployment_groups:

- id: htcondor_configure_access_point
source: modules/scripts/startup-script
kind: terraform
settings:
runners:
- type: shell
Expand All @@ -130,7 +121,6 @@ deployment_groups:
queue
- id: htcondor_access
source: modules/compute/vm-instance
kind: terraform
use:
- network1
- htcondor_configure_access_point
Expand Down
3 changes: 0 additions & 3 deletions community/examples/intel/daos-cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,12 @@ deployment_groups:
modules:
- id: network1
source: modules/network/pre-existing-vpc
kind: terraform

# This module creates a DAOS server. Server images MUST be created before running this.
# https://github.com/daos-stack/google-cloud-daos/tree/main/images
# more info: https://github.com/daos-stack/google-cloud-daos/tree/main/terraform/modules/daos_server
- id: daos-server
source: github.com/daos-stack/google-cloud-daos.git//terraform/modules/daos_server?ref=v0.2.1
kind: terraform
use: [network1]
settings:
number_of_instances: 2
Expand All @@ -48,7 +46,6 @@ deployment_groups:
# more info: https://github.com/daos-stack/google-cloud-daos/tree/main/terraform/modules/daos_client
- id: daos-client
source: github.com/daos-stack/google-cloud-daos.git//terraform/modules/daos_client?ref=v0.2.1
kind: terraform
use: [network1, daos-server]
settings:
number_of_instances: 2
Expand Down
Loading