Note
This document uses SlurmGCP v6 version of hcls blueprint. If you want to use SlurmGCP v5 version, please refer to this blueprint.
This folder captures an advanced architecture that can be used to run GROMACS with GPUs or CPUs on Google Cloud.
There are several ways to get started with the HCLS blueprint.
First you will want deploy the blueprint following the Deployment Instructions.
Once deployed, you can test the cluster by running an example workload:
- Water Benchmark Example: All the inputs needed to run this example are included as part of the blueprint. This makes this example an easy test case to run GROMACS and confirm that the cluster is working as expected.
- Lysozyme Example: This example demonstrates a real life case of simulating the Lysozyme protein in water. It is a multi-step GPU enabled GROMACS simulation. This example was featured in this YouTube Video.
The blueprint includes:
- Auto-scaling Slurm cluster
- Filestore for shared NFS storage
- Input and output Google Cloud Storage bucket
- GPU accelerated remote desktop for visualization
- Software builder VM to compile molecular dynamics software
This blueprint has 4 deployment groups:
enable_apis
: Ensure that all of the needed apis are enabled before deploying the cluster.setup
: Setup backbone infrastructure such as networking, file systems, & monitoring.software_installation
: Compile and install HPC applications and populate the input library.cluster
: Deploys an auto-scaling cluster and remote desktop.
Having multiple deployment groups decouples the life cycle of some infrastructure. For example a) you can tear down the cluster while leaving the storage intact and b) you can build software before you deploy your cluster.
Warning
This tutorial uses the following billable components of Google Cloud:
- Compute Engine
- Filestore
- Cloud Storage
To avoid continued billing once the tutorial is complete, closely follow the teardown instructions. Additionally, you may want to deploy this tutorial into a new project that can be deleted when the tutorial is complete. To generate a cost estimate based on your projected usage, use the pricing calculator.
Important
Before attempting to execute the following instructions, it is important to
consider your project's quota. The hcls-blueprint.yaml
blueprint creates an
autoscaling cluster that, when fully scaled up, can deploy up to 20
a2-highgpu-1g
and c2-standard-60
VMs.
To fully scale up this cluster, the project would require quota for (at least):
- GPU Node Group
- 12 CPUs * 20 VMs = 120
A2 CPUs
- 1 GPU * 20 VMs = 20
NVIDIA A100 GPUs
- 12 CPUs * 20 VMs = 120
- Compute Node Group
- 60 CPUs * 20 VMs = 1200
C2 CPUs
- 60 CPUs * 20 VMs = 1200
- Slurm Login VM
- 2
N2 CPUs
- 2
- Slurm Controller VM
- 4
C2 CPUs
- 4
Neither the Water Benchmark Example or the Lysozyme Example require the cluster to fully scale up. Please see:
- Water Benchmark Example Quota Requirements
- Lysozyme Example Quota Requirements
-
Clone the repo
git clone https://github.com/GoogleCloudPlatform/hpc-toolkit.git cd hpc-toolkit
-
Build the HPC Toolkit
make
-
Generate the deployment folder after replacing
<project>
with the project id.If you are running this as a test, and don't care about the files created in the cloud buckets being destroyed, it is recommended you run:
./ghpc create examples/hcls-blueprint.yaml -w --vars project_id=<project> --vars bucket_force_delete=true
The
bucket_force_delete
variable makes it easier to tear down the deployment. If it is set to the default value offalse
, buckets with objects (files) will not be deleted and the./ghpc destroy
command will fail partway through.If the data stored in the buckets should be preseverved, remove the
--vars bucket_force_delete=true
portion of the command or set it tofalse
-
Deploy the
enable_apis
groupCall the following ghpc command to deploy the the hcls blueprint.
./ghpc deploy hcls-01
This will prompt you to display, apply, stop, or continue without applying the
enable_apis
group. Select apply.This will ensure that all of the needed apis are enabled before deploying the cluster.
[!WARNING] This ghpc command will run through 4 groups (
enable_apis
,setup
,software_installation
, andcluster
) and prompt you to apply each one. If the command is cancelled or exited by accident before finishing, it can be rerun to continue deploying the blueprint. -
Deploy the
setup
groupThe next
ghpc
prompt will ask you to display, apply, stop, or continue without applying thesetup
group. Select 'apply'.This group will create a network and file systems to be used by the cluster.
[!NOTE] At this point do not proceed with the ghpc prompt for the
cluster
group. Continue with the steps below before proceeding.This step will create a storage bucket for depositing software. The bucket will have the prefix
hcls-user-provided-software
followed by a the deployment name (e.g.hcls-01
) and a random suffix, for examplehcls-user-provided-software-hcls-01-34c8749a
.Here are two ways to locate the bucket name:
- At the end of the
setup
deployment, ghpc should output a lineOutputs:
. Under that there should be a line similar togcs_bucket_path_bucket-software = "gs://hcls-user-provided-software-hcls-01-84d0b51e"
, the bucket name is located within the quotes aftergs://
- On the GCP Cloud Console, you can navigate to Cloud Storage -> Buckets and
assuming you have not created two deployments with the same name, there
should only be one bucket with a name like
hcls-user-provided-software-hcls-01-34c8749a
Copy this bucket name for the next step.
- At the end of the
-
Upload VMD tarball
VMD is visualization software used by the remote desktop. While the software is free the user must register before downloading it.
To download the software, complete the registration here and then download the tarball. The blueprint has been tested with the
LINUX_64 OpenGL, CUDA, OptiX, OSPRay
version (vmd-1.9.3.bin.LINUXAMD64-CUDA8-OptiX4-OSPRay111p1.opengl.tar.gz
) but should work with any compatible 1.9.x version.Next, upload the
tar.gz
file to the bucket created during the deployment ofsetup
, its name was copied at the end of the last step. The virtual desktop will automatically look for this file when booting up. To do this using the Google Cloud UI:- Navigate to the Cloud Storage page.
- Click on the bucket with the name provided for
bucket_name_software
. - Click on
UPLOAD FILES
. - Select the
tar.gz
file for VMD.
-
Deploy the
software_installation
group.Once the file from the prior step has been completely uploaded, you can return to the ghpc command which will ask you to display, apply, stop, or continue without applying the
software_installation
group. Select 'apply'.This group will deploy a builder VM that will build GROMACS and save the compiled application on the apps Filestore.
This will take several hours to run. After the software installation is complete the builder VM will automatically shut itself down. This allows you to monitor the status of the builder VM to know when installation has finished.
You can check the serial port 1 logs and the Spack logs (
/var/log/spack.log
) to check status. If the builder VM never shuts down it may be a sign that something went wrong with the software installation.This builder VM can be shut down or deleted once the software installation has completed successfully.
-
Deploy the
cluster
groupThe next
ghpc
prompt will ask you to display, apply, stop, or continue without applying thecluster
group. Select 'apply'.This deployment group contains the Slurm cluster and the Chrome remote desktop visualization node.
-
Set up Chrome Remote Desktop
- Follow the instructions for setting up the Remote Desktop.
Note
If you created a new project for this tutorial, the easiest way to eliminate billing is to delete the project.
When you would like to tear down the deployment, each stage must be destroyed,
with the exception of the enable_apis
stage. Since the software_installation
and cluster
depend on the network deployed in the setup
stage, they must be
destroyed first. You can use the following commands to destroy the deployment.
Warning
If you do not destroy all three deployment groups then there may be continued associated costs.
./ghpc destroy hcls-01 --auto-approve
Note
If you did not create the deployment with bucket_force_destroy
set to true,
you may have to clean out items added to the Cloud Storage buckets before
terraform will be able to destroy them. This can be done on the GCP Cloud
Console.
As part of deployment, the GROMACS water benchmark has been placed in the
/data_input
Cloud Storage bucket. Additionally two sbatch Slurm submission
scripts have been placed in the /apps/gromacs
directory, one uses CPUs and the
other uses GPUs.
Note
Make sure that you have followed all of the deployment instructions before running this example.
The Water Benchmark Example only deploys one computational VM from the blueprint, as such you will only need quota for either:
- GPU: 12
A2 CPUs
and 1NVIDIA A100 GPUs
- CPU: 60
C2 CPUs
Note that these quotas are in addition to the quota requirements for the slurm
login node (2x N2 CPUs
) and slurm controller VM (4x C2 CPUs
). The
spack-builder
VM should have completed and stopped, freeing its CPU quota
usage, before the computational VMs are deployed.
-
SSH into the Slurm login node
Go to the VM instances page and you should see a VM with
login
in the name. SSH into this VM by clicking theSSH
button or by any other means. -
Create a submission directory
mkdir water_run && cd water_run
-
Submit the GROMACS job
There are two example sbatch scripts which have been populated at:
/apps/gromacs/submit_gromacs_water_cpu.sh
/apps/gromacs/submit_gromacs_water_gpu.sh
The first of these runs on the
compute
partition, which uses CPUs on ac2-standard-60
machine. The second targets thegpu
partition. It runs on ana2-highgpu-1g
machine and uses a NVIDIA A100 for GPU acceleration.The example below runs the GPU version of the job. You can switch out the path of the script to try the CPU version.
Submit the sbatch script with the following commands:
sbatch /apps/gromacs/submit_gromacs_water_gpu.sh
-
Monitor the job
Use the following command to see the status of the job:
squeue
The job state (
ST
) will showCF
while the job is being configured. Once the state switches toR
the job is running.If you refresh the VM instances page you will see an
a2-highgpu-1g
machine that has been auto-scaled up to run this job. It will have a name likehcls01-gpu-ghpc-0
.Once the job is in the running state you can track progress with the following command:
tail -f slurm-*.out
When the job has finished end of the
slurm-*.out
file will print performance metrics such asns/day
.