Skip to content

Commit

Permalink
Merge pull request #2135 from harshthakkar01/intel-select
Browse files Browse the repository at this point in the history
Remove intel-select blueprints and references
  • Loading branch information
harshthakkar01 authored Jan 17, 2024
2 parents 0b5de42 + a68d8f6 commit b5b8671
Show file tree
Hide file tree
Showing 8 changed files with 8 additions and 693 deletions.
149 changes: 0 additions & 149 deletions community/examples/intel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,155 +33,6 @@
- [Unmount the Container](#unmount-the-container)
- [Delete the DAOS/Slurm Cluster infrastructure when not in use](#delete-the-daosslurm-cluster-infrastructure-when-not-in-use)

## Intel-Optimized Slurm Cluster

This document is adapted from a [Cloud Shell tutorial][tutorial] developed to
demonstrate Intel Select Solutions within the Toolkit. It expands upon that
tutorial by building custom images that save provisioning time and improve
reliability when scaling up compute nodes.

The Google Cloud [HPC VM Image][hpcvmimage] has a built-in feature enabling it
to install a Google Cloud-tested release of Intel compilers and libraries that
are known to achieve optimal performance on Google Cloud.

[tutorial]: ../../../docs/tutorials/intel-select/intel-select.md
[hpcvmimage]: https://cloud.google.com/compute/docs/instances/create-hpc-vm

Identify a project to work in and substitute its unique id wherever you see
`<<PROJECT_ID>>` in the instructions below.

### Initial Setup for the Intel-Optimized Slurm Cluster

Before provisioning any infrastructure in this project you should follow the
Toolkit guidance to enable [APIs][apis] and establish minimum resource
[quotas][quotas]. In particular, the following APIs should be enabled

- [file.googleapis.com](https://cloud.google.com/filestore/docs/reference/rest) (Cloud Filestore)
- [compute.googleapis.com](https://cloud.google.com/compute/docs/reference/rest/v1#service:-compute.googleapis.com) (Google Compute Engine)

[apis]: ../../../README.md#enable-gcp-apis
[quotas]: ../../../README.md#gcp-quotas

And the following available quota is required in the region used by the cluster:

- Filestore: 2560GB
- C2 CPUs: 4 (login node)
- C2 CPUs: up to 6000 (fully-scaled "compute" partition)
- This quota is not necessary at initial deployment, but will be required to
successfully scale the partition to its maximum size

### Deploy the Slurm Cluster

Use `ghpc` to provision the blueprint, supplying your project ID

```text
ghpc create --vars project_id=<<PROJECT_ID>> community/examples/intel/hpc-intel-select-slurm.yaml
```

This will create a set of directories containing Terraform modules and Packer
templates. **Please ignore the printed instructions** in favor of the following:

1. Provision the network and startup scripts that install Intel software

```shell
terraform -chdir=hpc-intel-select/primary init
terraform -chdir=hpc-intel-select/primary validate
terraform -chdir=hpc-intel-select/primary apply
```

2. Capture the startup scripts to files that will be used by Packer to build the
images

```shell
terraform -chdir=hpc-intel-select/primary output \
-raw startup_script_startup_controller > \
hpc-intel-select/build1/controller-image/startup_script.sh
terraform -chdir=hpc-intel-select/primary output \
-raw startup_script_startup_compute > \
hpc-intel-select/build2/compute-image/startup_script.sh
```

3. Build the custom Slurm controller image. While this step is executing, you
may begin the next step in parallel.

```shell
cd hpc-intel-select/build1/controller-image
packer init .
packer validate .
packer build -var startup_script_file=startup_script.sh .
```

4. Build the custom Slurm image for login and compute nodes

```shell
cd -
cd hpc-intel-select/build2/compute-image
packer init .
packer validate .
packer build -var startup_script_file=startup_script.sh .
```

5. Provision the Slurm cluster

```shell
cd -
terraform -chdir=hpc-intel-select/cluster init
terraform -chdir=hpc-intel-select/cluster validate
terraform -chdir=hpc-intel-select/cluster apply
```

### Connect to the login node

Once the startup script has completed and Slurm reports readiness, connect to the login node.

1. Open the following URL in a new tab.

https://console.cloud.google.com/compute

This will take you to **Compute Engine > VM instances** in the Google Cloud Console

Ensure that you select the project in which you are provisioning the cluster.

2. Click on the **SSH** button associated with the `slurm-hpc-intel-select-login0`
instance.

This will open a separate pop up window with a terminal into our newly created
Slurm login VM.

### Access the cluster and provision an example job

**The commands below should be run on the login node.**

1. Create a default ssh key to be able to ssh between nodes

```shell
ssh-keygen -q -N '' -f ~/.ssh/id_rsa
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
```

1. Submit an example job

```shell
cp /var/tmp/dgemm_job.sh .
sbatch dgemm_job.sh
```

### Delete the infrastructure when not in use

> **_NOTE:_** If the Slurm controller is shut down before the auto-scale nodes
> are destroyed then they will be left running.

Open your browser to the VM instances page and ensure that nodes named "compute"
have been shutdown and deleted by the Slurm autoscaler. Delete the remaining
infrastructure in reverse order of creation:

```shell
terraform -chdir=hpc-intel-select/cluster destroy
terraform -chdir=hpc-intel-select/primary destroy
```

## DAOS Cluster

The [pfs-daos.yaml](pfs-daos.yaml) blueprint describes an environment with
Expand Down
162 changes: 0 additions & 162 deletions community/examples/intel/hpc-intel-select-slurm.yaml

This file was deleted.

19 changes: 4 additions & 15 deletions docs/tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,6 @@
Find the quickstart tutorial on
[Google Cloud docs](https://cloud.google.com/hpc-toolkit/docs/quickstarts/slurm-cluster).

## Intel Select Tutorial

Walks through deploying an HPC cluster that is based on the
[HPC virtual machine (VM) image][hpc-vm-image] and complies to the
[Intel Select Solution for Simulation and Modeling criteria][intel-select].

Click the button below to launch the Intel Select tutorial.

[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://shell.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2Fhpc-toolkit&cloudshell_open_in_editor=docs%2Ftutorials%2Fintel-select%2Fhpc-cluster-intel-select.yaml&cloudshell_tutorial=docs%2Ftutorials%2Fintel-select%2Fintel-select.md)

[hpc-vm-image]: https://cloud.google.com/compute/docs/instances/create-hpc-vm
[intel-select]: https://www.intel.com/content/www/us/en/products/solutions/select-solutions/hpc/simulation-modeling.html

## HTCondor Tutorial

Walk through deploying an HTCondor pool that supports jobs running inside Docker
Expand All @@ -27,6 +14,8 @@ Click the button below to launch the HTCondor tutorial.

[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://shell.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2Fhpc-toolkit&cloudshell_open_in_editor=community%2Fexamples%2Fhtc-htcondor.yaml&cloudshell_tutorial=docs%2Ftutorials%2Fhtcondor.md)

[hpc-vm-image]: https://cloud.google.com/compute/docs/instances/create-hpc-vm

## SC-23 Tutorial

[Blueprint](./sc23-tutorial/hcls-blueprint.yaml) used in the Supercomputing 2023 tutorial “Unlocking the potential of HPC in the Google Cloud with Open-Source Tools”
Expand Down Expand Up @@ -61,11 +50,11 @@ modules relate to each other.

```mermaid
graph TB
A(Virtual Private Cloud)
A(Virtual Private Cloud)
C(Spack Install Script)
D(Startup Scripts)
E(Compute Partition)
F(Slurm Controller)
F(Slurm Controller)
G(Slurm Login Node)
B(Monitoring Dashboard)
C --> D
Expand Down
Loading

0 comments on commit b5b8671

Please sign in to comment.