From 3d5958f12b866385972d5c48ffca3ca5b8c2a8fd Mon Sep 17 00:00:00 2001 From: Mark Olson <115657904+mark-olson@users.noreply.github.com> Date: Fri, 19 Jan 2024 08:07:54 -0800 Subject: [PATCH 1/4] Update DAOS blueprints to use google-cloud-daos v0.5.0, slurm v6 [DAOSGCP-182](https://daosio.atlassian.net/browse/DAOSGCP-182) - Bump version of DAOS modules to v0.5.0 which install DAOS v2.4 - Modify community/examples/intel/hpc-slurm-daos.yaml to use Slurm v6 modules - Add temporary fix to community/examples/intel/hpc-slurm-daos.yaml to work around issue with missing lustre-client 8.8 repo - Update community/examples/intel/README.md to account for changes in DAOS v2.4 Signed-off-by: Mark Olson <115657904+mark-olson@users.noreply.github.com> --- community/examples/intel/README.md | 145 +++++++++-------- community/examples/intel/hpc-slurm-daos.yaml | 158 ++++++++++++------- community/examples/intel/pfs-daos.yaml | 44 +++--- 3 files changed, 197 insertions(+), 150 deletions(-) diff --git a/community/examples/intel/README.md b/community/examples/intel/README.md index 439921e31a..8b0f4072d8 100644 --- a/community/examples/intel/README.md +++ b/community/examples/intel/README.md @@ -4,12 +4,6 @@ - [Intel Solutions for the HPC Toolkit](#intel-solutions-for-the-hpc-toolkit) - - [Intel-Optimized Slurm Cluster](#intel-optimized-slurm-cluster) - - [Initial Setup for the Intel-Optimized Slurm Cluster](#initial-setup-for-the-intel-optimized-slurm-cluster) - - [Deploy the Slurm Cluster](#deploy-the-slurm-cluster) - - [Connect to the login node](#connect-to-the-login-node) - - [Access the cluster and provision an example job](#access-the-cluster-and-provision-an-example-job) - - [Delete the infrastructure when not in use](#delete-the-infrastructure-when-not-in-use) - [DAOS Cluster](#daos-cluster) - [Initial Setup for DAOS Cluster](#initial-setup-for-daos-cluster) - [Deploy the DAOS Cluster](#deploy-the-daos-cluster) @@ -17,7 +11,7 @@ - [Verify the DAOS storage system](#verify-the-daos-storage-system) - [Create a DAOS Pool and Container](#create-a-daos-pool-and-container) - [About the DAOS Command Line Tools](#about-the-daos-command-line-tools) - - [Determine Free Space](#determine-free-space) + - [View Free Space](#view-free-space) - [Create a Pool](#create-a-pool) - [Create a Container](#create-a-container) - [Mount the DAOS Container](#mount-the-daos-container) @@ -47,16 +41,22 @@ for general information on building custom images using the Toolkit. Identify a project to work in and substitute its unique id wherever you see `<>` in the instructions below. +[google-cloud-daos]: https://github.com/daos-stack/google-cloud-daos +[pre-deployment_guide]: https://github.com/daos-stack/google-cloud-daos/blob/main/docs/pre-deployment_guide.md +[DAOS Yum Repository]: https://packages.daos.io + ### Initial Setup for DAOS Cluster Before provisioning the DAOS cluster you must follow the steps listed in the [Google Cloud DAOS Pre-deployment Guide][pre-deployment_guide]. Skip the "Build DAOS Images" step at the end of the [Pre-deployment Guide][pre-deployment_guide]. The [pfs-daos.yaml](pfs-daos.yaml) blueprint will build the images as part of the deployment. -The Pre-deployment Guide provides instructions for enabling service accounts, APIs, establishing minimum resource quotas and other necessary steps to prepare your project. - -[google-cloud-daos]: https://github.com/daos-stack/google-cloud-daos -[pre-deployment_guide]: https://github.com/daos-stack/google-cloud-daos/blob/main/docs/pre-deployment_guide.md +The Pre-deployment Guide provides instructions for: +- installing the Google Cloud CLI +- enabling service accounts +- enabling APIs +- establishing minimum resource quotas +- creating a Cloud NAT to allow instances without public IPs to access the [DAOS Yum Repository] repository. ### Deploy the DAOS Cluster @@ -98,7 +98,7 @@ ghpc deploy pfs-daos --auto-approve The `community/examples/intel/pfs-daos.yaml` blueprint does not contain configuration for DAOS pools and containers. Therefore, pools and containers will need to be created manually. -Before pools and containers can be created the storage system must be formatted. Formatting the storage is done automatically by the startup script that runs on the *daos-server-0001* instance. The startup script will run the [dmg storage format](https://docs.daos.io/v2.2/admin/deployment/?h=dmg+storage#storage-formatting) command. It may take a few minutes for all daos server instances to join. +Before pools and containers can be created the storage system must be formatted. Formatting the storage is done automatically by the startup script that runs on the *daos-server-0001* instance. The startup script will run the [dmg storage format](https://docs.daos.io/v2.4/admin/deployment/?h=dmg+storage#storage-formatting) command. It may take a few minutes for all daos server instances to join. Verify that the storage system has been formatted and that the daos-server instances have joined. @@ -123,35 +123,24 @@ Both daos-server instances should show a state of *Joined*. #### About the DAOS Command Line Tools -The DAOS Management tool `dmg` is used by System Administrators to manage the DAOS storage [system](https://docs.daos.io/v2.2/overview/architecture/#daos-system) and DAOS [pools](https://docs.daos.io/v2.2/overview/storage/#daos-pool). Therefore, `sudo` must be used when running `dmg`. +The DAOS Management tool `dmg` is used by System Administrators to manage the DAOS storage [system](https://docs.daos.io/v2.4/overview/architecture/#daos-system) and DAOS [pools](https://docs.daos.io/v2.4/overview/storage/#daos-pool). Therefore, `sudo` must be used when running `dmg`. -The DAOS CLI `daos` is used by both users and System Administrators to create and manage [containers](https://docs.daos.io/v2.2/overview/storage/#daos-container). It is not necessary to use `sudo` with the `daos` command. +The DAOS CLI `daos` is used by both users and System Administrators to create and manage [containers](https://docs.daos.io/v2.4/overview/storage/#daos-container). It is not necessary to use `sudo` with the `daos` command. -#### Determine Free Space +#### View Free Space -Determine how much free space is available. +View how much free space is available. ```bash sudo dmg storage query usage ``` -The result will look similar to - -```text -Hosts SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used ------ --------- -------- -------- ---------- --------- --------- -daos-server-0001 215 GB 215 GB 0 % 6.4 TB 6.4 TB 0 % -daos-server-0002 215 GB 215 GB 0 % 6.4 TB 6.4 TB 0 % -``` - -In the example output above we see that there is a total of 12.8TB NVME-Free. - #### Create a Pool -Create a single pool owned by root which uses all available free space. +Create a single pool owned by root which uses 100% of the available free space. ```bash -sudo dmg pool create -z 12.8TB -t 3 -u root --label=pool1 +sudo dmg pool create --size=100% --user=root pool1 ``` Set ACLs to allow any user to create a container in *pool1*. @@ -160,7 +149,7 @@ Set ACLs to allow any user to create a container in *pool1*. sudo dmg pool update-acl -e A::EVERYONE@:rcta pool1 ``` -See the [Pool Operations](https://docs.daos.io/v2.2/admin/pool_operations) section of the of the DAOS Administration Guide for more information about creating pools. +See the [Pool Operations](https://docs.daos.io/v2.4/admin/pool_operations) section of the of the DAOS Administration Guide for more information about creating pools. #### Create a Container @@ -170,24 +159,18 @@ and how it will be used. The ACLs will need to be set properly to allow users an For the purpose of this demo create the container without specifying ACLs. The container will be owned by your user account and you will have full access to the container. ```bash -daos cont create pool1 \ - --label cont1 \ - --type POSIX \ - --properties rf:0 +daos container create --type=POSIX --properties=rf:0 pool1 cont1 ``` -See the [Container Management](https://docs.daos.io/v2.2/user/container) section of the of the DAOS User Guide for more information about creating containers. +See the [Container Management](https://docs.daos.io/v2.4/user/container) section of the of the DAOS User Guide for more information about creating containers. #### Mount the DAOS Container Mount the container with dfuse (DAOS Fuse) ```bash -mkdir -p ${HOME}/daos/cont1 -dfuse --singlethread \ - --pool=pool1 \ - --container=cont1 \ - --mountpoint=${HOME}/daos/cont1 +mkdir -p "${HOME}/daos/cont1" +dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1" ``` Verify that the container is mounted @@ -207,27 +190,47 @@ time LD_PRELOAD=/usr/lib64/libioil.so \ dd if=/dev/zero of="${HOME}/daos/cont1/test20GiB.img" iflag=fullblock bs=1G count=20 ``` -See the [File System](https://docs.daos.io/v2.2/user/filesystem/) section of the DAOS User Guide for more information about DFuse. +**Known Issue:** -### Unmount the DAOS Container +When you run `ls -lh "${HOME}/daos/cont1"` you may see that the `test20GiB.img` file shows a size of 0 bytes. -The container will need to by unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again. +If you unmount the container and mount it again, the file size will show as 20G. ```bash -fusermount3 -u ${HOME}/daos/cont1 +fusermount3 -u "${HOME}/daos/cont1" +dfuse --singlethread --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1" +ls -lh "${HOME}/daos/cont1" +``` + +A work-around for this issue to disable caching when mounting the container. + +``` +dfuse --singlethread --disable-caching --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1" ``` +See the [File System](https://docs.daos.io/v2.4/user/filesystem/) section of the DAOS User Guide for more information about DFuse. + +### Unmount the DAOS Container + +The container will need to by unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again. + Verify that the container is unmounted ```bash df -h -t fuse.daos ``` -See the [DFuse (DAOS FUSE)](https://docs.daos.io/v2.2/user/filesystem/?h=dfuse#dfuse-daos-fuse) section of the DAOS User Guide for more information about mounting POSIX containers. +Logout of the DAOS client instance. + +```bash +logout +``` + +See the [DFuse (DAOS FUSE)](https://docs.daos.io/v2.4/user/filesystem/?h=dfuse#dfuse-daos-fuse) section of the DAOS User Guide for more information about mounting POSIX containers. ### Delete the DAOS infrastructure when not in use -> **_NOTE:_** All the DAOS data will be permanently lost after cluster deletion. +> **_NOTE:_** Data stored in the DAOS container will be permanently lost after cluster deletion. Delete the remaining infrastructure @@ -237,21 +240,21 @@ ghpc destroy pfs-daos --auto-approve ## DAOS Server with Slurm cluster -The [hpc-slurm-daos.yaml](hpc-slurm-daos.yaml) blueprint describes an environment with a Slurm cluster and four DAOS server instances. The compute nodes are configured as DAOS clients and have the ability to use the DAOS filesystem on the DAOS server instances. +The [hpc-slurm-daos.yaml](hpc-slurm-daos.yaml) blueprint can be used to deploy a Slurm cluster and four DAOS server instances. The Slurm compute instances are configured as DAOS clients. The blueprint uses modules from - [google-cloud-daos][google-cloud-daos] -- [community/modules/scheduler/SchedMD-slurm-on-gcp-controller][SchedMD-slurm-on-gcp-controller] -- [community/modules/scheduler/SchedMD-slurm-on-gcp-login-node][SchedMD-slurm-on-gcp-login-node] -- [community/modules/compute/SchedMD-slurm-on-gcp-partition][SchedMD-slurm-on-gcp-partition] +- [community/modules/compute/schedmd-slurm-gcp-v6-nodeset][schedmd-slurm-gcp-v6-nodeset] +- [community/modules/compute/schedmd-slurm-gcp-v6-partition][schedmd-slurm-gcp-v6-partition] +- [community/modules/scheduler/schedmd-slurm-gcp-v6-login][schedmd-slurm-gcp-v6-login] +- [community/modules/scheduler/schedmd-slurm-gcp-v6-controller][schedmd-slurm-gcp-v6-controller] The blueprint also uses a Packer template from the [Google Cloud DAOS][google-cloud-daos] repository. Please review the [introduction to image building](../../../docs/image-building.md) for general information on building custom images using the Toolkit. -Identify a project to work in and substitute its unique id wherever you see -`<>` in the instructions below. +Substitute your project ID wherever you see `<>` in the instructions below. ### Initial Setup for the DAOS/Slurm cluster @@ -259,16 +262,16 @@ Before provisioning the DAOS cluster you must follow the steps listed in the [Go Skip the "Build DAOS Images" step at the end of the [Pre-deployment Guide][pre-deployment_guide]. The [hpc-slurm-daos.yaml](hpc-slurm-daos.yaml) blueprint will build the DAOS server image as part of the deployment. -The Pre-deployment Guide provides instructions for enabling service accounts, APIs, establishing minimum resource quotas and other necessary steps to prepare your project for DAOS server deployment. +The [Pre-deployment Guide][pre-deployment_guide] provides instructions for enabling service accounts, APIs, establishing minimum resource quotas and other necessary steps to prepare your project for DAOS server deployment. [google-cloud-daos]: https://github.com/daos-stack/google-cloud-daos [pre-deployment_guide]: https://github.com/daos-stack/google-cloud-daos/blob/main/docs/pre-deployment_guide.md - [packer-template]: https://github.com/daos-stack/google-cloud-daos/blob/main/images/daos.pkr.hcl [apis]: ../../../README.md#enable-gcp-apis -[SchedMD-slurm-on-gcp-controller]: ../../modules/scheduler/SchedMD-slurm-on-gcp-controller -[SchedMD-slurm-on-gcp-login-node]: ../../modules/scheduler/SchedMD-slurm-on-gcp-login-node -[SchedMD-slurm-on-gcp-partition]: ../../modules/compute/SchedMD-slurm-on-gcp-partition +[schedmd-slurm-gcp-v6-nodeset]: ../../modules/compute/schedmd-slurm-gcp-v6-nodeset +[schedmd-slurm-gcp-v6-partition]: ../../modules/compute/schedmd-slurm-gcp-v6-partition +[schedmd-slurm-gcp-v6-controller]: ../../modules/scheduler/schedmd-slurm-gcp-v6-controller +[schedmd-slurm-gcp-v6-login]: ../../modules/scheduler/schedmd-slurm-gcp-v6-login Follow the Toolkit guidance to enable [APIs][apis] and establish minimum resource [quotas][quotas] for Slurm. @@ -301,7 +304,7 @@ The `--backend-config` option is not required but recommended. It will save the Follow `ghpc` instructions to deploy the environment ```text -ghpc deploy daos-slurm --auto-approve +ghpc deploy hpc-slurm-daos --auto-approve ``` [backend]: ../../../examples/README.md#optional-setting-up-a-remote-terraform-state @@ -319,7 +322,7 @@ Once the startup script has completed and Slurm reports readiness, connect to th Select the project in which the cluster will be provisionsd. -2. Click on the `SSH` button associated with the `slurm-daos-slurm-login0` +2. Click on the `SSH` button associated with the `hpcslurmda-login-login-001` instance. This will open a separate pop up window with a terminal into our newly created @@ -334,10 +337,12 @@ You will need to create your own DAOS container in the pool that can be used by While logged into the login node create a container named `cont1` in the `pool1` pool: ```bash -daos cont create --type=POSIX --properties=rf:0 --label=cont1 pool1 +daos cont create --type=POSIX --properties=rf:0 pool1 cont1 ``` -Since the `cont1` container is owned by your account, your Slurm jobs will need to run as your user account in order to access the container. +NOTE: If you encounter an error `daos: command not found`, it's likely that the startup scripts have not finished running yet. Wait a few minutes and try again. + +Since the `cont1` container is owned by your account, your Slurm jobs will need to run as your user account to access the container. Create a mount point for the container and mount it with dfuse (DAOS Fuse) @@ -389,6 +394,7 @@ echo "Job ${SLURM_JOB_ID} running on ${JOB_HOSTNAME}" | tee "${MOUNT_DIR}/${TIME echo "${JOB_HOSTNAME} : Unmounting dfuse" fusermount3 -u "${MOUNT_DIR}" + ``` Run the `daos_job.sh` script in an interactive Slurm job on 4 nodes @@ -426,21 +432,20 @@ Verify that the container is unmounted df -h -t fuse.daos ``` -See the [DFuse (DAOS FUSE)](https://docs.daos.io/v2.2/user/filesystem/?h=dfuse#dfuse-daos-fuse) section of the DAOS User Guide for more information about mounting POSIX containers. +See the [DFuse (DAOS FUSE)](https://docs.daos.io/v2.4/user/filesystem/?h=dfuse#dfuse-daos-fuse) section of the DAOS User Guide for more information about mounting POSIX containers. ### Delete the DAOS/Slurm Cluster infrastructure when not in use -> **_NOTE:_** All the DAOS data will be permanently lost after cluster deletion. +> **_NOTE:_** All data on the DAOS file system will be permanently lost after cluster deletion. - +> **_NOTE:_** If the Slurm controller is shut down before the auto-scale instances +> are destroyed those instances will be left running. -> **_NOTE:_** If the Slurm controller is shut down before the auto-scale nodes -> are destroyed then they will be left running. +Open your browser to the VM instances page and ensure that instances named "compute" +have been shutdown and deleted by the Slurm autoscaler. -Open your browser to the VM instances page and ensure that nodes named "compute" -have been shutdown and deleted by the Slurm autoscaler. Delete the remaining -infrastructure with `terraform`: +Delete the remaining infrastructure: ```shell -ghpc destroy daos-slurm --auto-approve +ghpc destroy hpc-slurm-daos --auto-approve ``` diff --git a/community/examples/intel/hpc-slurm-daos.yaml b/community/examples/intel/hpc-slurm-daos.yaml index cd79bdc203..acc99c9050 100644 --- a/community/examples/intel/hpc-slurm-daos.yaml +++ b/community/examples/intel/hpc-slurm-daos.yaml @@ -1,4 +1,4 @@ -# Copyright 2022 Google LLC +# Copyright 2024 Google LLC # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -18,38 +18,42 @@ blueprint_name: hpc-slurm-daos vars: project_id: ## Set GCP Project ID Here ## - deployment_name: daos-slurm + deployment_name: hpc-slurm-daos region: us-central1 zone: us-central1-c - server_image_family: daos-server-hpc-rocky-8 - -# Documentation for each of the modules used below can be found at -# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md + daos_server_image_family: daos-server-hpc-rocky-8 + daos_version: "2.4" + tags: [] # Note: this blueprint assumes the existence of a default global network and # subnetwork in the region chosen above +validators: +- validator: test_module_not_used + inputs: {} + skip: true + deployment_groups: - group: primary modules: - id: network1 - source: modules/network/pre-existing-vpc + source: modules/network/vpc - id: homefs source: modules/file-system/filestore use: [network1] settings: - local_mount: "/home" + local_mount: /home - group: daos-server-image modules: - # more info: https://github.com/daos-stack/google-cloud-daos/tree/v0.4.1/images + # more info: https://github.com/daos-stack/google-cloud-daos/tree/main/images - id: daos-server-image - source: github.com/daos-stack/google-cloud-daos//images?ref=v0.4.1&depth=1 + source: "github.com/daos-stack/google-cloud-daos//images?ref=v0.5.0&depth=1" kind: packer settings: - daos_version: 2.2.0 - daos_repo_base_url: https://packages.daos.io + daos_version: $(vars.daos_version) + daos_repo_base_url: https://packages.daos.io/ daos_packages_repo_file: EL8/packages/x86_64/daos_packages.repo use_iap: true enable_oslogin: false @@ -63,26 +67,25 @@ deployment_groups: use_internal_ip: true omit_external_ip: false daos_install_type: server - image_family: $(vars.server_image_family) + image_family: $(vars.daos_server_image_family) - group: cluster modules: # more info: https://github.com/daos-stack/google-cloud-daos/tree/main/terraform/modules/daos_server - id: daos - source: github.com/daos-stack/google-cloud-daos//terraform/modules/daos_server?ref=v0.4.1&depth=1 + source: "github.com/daos-stack/google-cloud-daos//terraform/modules/daos_server?ref=v0.5.0&depth=1" use: [network1] settings: labels: {ghpc_role: file-system} - # The default DAOS settings are optimized for TCO - # The following will tune this system for best perf machine_type: "n2-standard-16" - os_family: $(vars.server_image_family) + os_family: $(vars.daos_server_image_family) daos_disk_count: 4 - daos_scm_size: 45 + tags: $(vars.tags) pools: - name: "pool1" - size: "6.4TB" - tier_ratio: 3 + size: "100%" + # Do not set value for scm_size when size=100% + daos_scm_size: user: "root@" group: "root@" acls: @@ -98,67 +101,102 @@ deployment_groups: settings: runners: - type: shell - content: $(daos.daos_client_install_script) - destination: /tmp/daos_client_install.sh + destination: remove_lustre_client_repo.sh + content: | + #!/bin/bash + rm -f /etc/yum.repos.d/lustre-client.repo + dnf clean all --verbose + rm -rf /var/cache/dnf/* + dnf makecache - type: data content: $(daos.daos_agent_yml) destination: /etc/daos/daos_agent.yml - type: data content: $(daos.daos_control_yml) destination: /etc/daos/daos_control.yml + - type: shell + content: $(daos.daos_client_install_script) + destination: /tmp/daos_client_install.sh - type: shell content: $(daos.daos_client_config_script) - destination: /var/daos/daos_client_config.sh + destination: /tmp/daos_client_config.sh + + - id: debug_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: [network1] + settings: + name: ns1 + node_count_dynamic_max: 4 + machine_type: n2-standard-2 + enable_placement: false # the default is: true + service_account: + email: null + scopes: + - "https://www.googleapis.com/auth/monitoring.write" + - "https://www.googleapis.com/auth/logging.write" + - "https://www.googleapis.com/auth/devstorage.read_only" + - "https://www.googleapis.com/auth/cloud-platform" - ## This debug_partition will work out of the box without requesting additional GCP quota. - id: debug_partition - source: community/modules/compute/SchedMD-slurm-on-gcp-partition - use: - - network1 - - homefs + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: [debug_nodeset, homefs] settings: partition_name: debug - max_node_count: 4 - enable_placement: false - machine_type: n2-standard-2 + exclusive: false # allows nodes to stay up after jobs are done + is_default: true - # This compute_partition is far more performant than debug_partition but may require requesting GCP quotas first. - - id: compute_partition - source: community/modules/compute/SchedMD-slurm-on-gcp-partition - use: - - network1 - - homefs + - id: compute_nodeset + source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset + use: [network1] settings: - partition_name: compute - max_node_count: 20 + name: ns2 + node_count_dynamic_max: 20 bandwidth_tier: gvnic_enabled + service_account: + email: null + scopes: + - "https://www.googleapis.com/auth/monitoring.write" + - "https://www.googleapis.com/auth/logging.write" + - "https://www.googleapis.com/auth/devstorage.read_only" + - "https://www.googleapis.com/auth/cloud-platform" - - id: slurm_controller - source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller - use: - - network1 - - homefs - - debug_partition # debug partition will be default as it is listed first - - compute_partition - - daos-client-script + - id: compute_partition + source: community/modules/compute/schedmd-slurm-gcp-v6-partition + use: [compute_nodeset, homefs] settings: - login_node_count: 1 - compute_node_scopes: - - "https://www.googleapis.com/auth/monitoring.write" - - "https://www.googleapis.com/auth/logging.write" - - "https://www.googleapis.com/auth/devstorage.read_only" - - "https://www.googleapis.com/auth/cloud-platform" + partition_name: compute - id: slurm_login - source: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node + source: community/modules/scheduler/schedmd-slurm-gcp-v6-login + use: [network1] + settings: + name_prefix: login + machine_type: n2-standard-4 + disable_login_public_ips: false + tags: $(vars.tags) + service_account: + email: null + scopes: + - "https://www.googleapis.com/auth/monitoring.write" + - "https://www.googleapis.com/auth/logging.write" + - "https://www.googleapis.com/auth/devstorage.read_only" + - "https://www.googleapis.com/auth/cloud-platform" + + - id: slurm_controller + source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller use: - network1 + - debug_partition + - compute_partition + - slurm_login - homefs - - slurm_controller - daos-client-script settings: - login_scopes: - - "https://www.googleapis.com/auth/monitoring.write" - - "https://www.googleapis.com/auth/logging.write" - - "https://www.googleapis.com/auth/devstorage.read_only" - - "https://www.googleapis.com/auth/cloud-platform" + disable_controller_public_ips: false + compute_startup_script: $(daos-client-script.startup_script) + controller_startup_script: $(daos-client-script.startup_script) + login_startup_script: $(daos-client-script.startup_script) + compute_startup_scripts_timeout: 1000 + controller_startup_scripts_timeout: 1000 + login_startup_scripts_timeout: 1000 + tags: $(vars.tags) diff --git a/community/examples/intel/pfs-daos.yaml b/community/examples/intel/pfs-daos.yaml index 648aba9403..3abf5c9778 100644 --- a/community/examples/intel/pfs-daos.yaml +++ b/community/examples/intel/pfs-daos.yaml @@ -1,4 +1,4 @@ -# Copyright 2022 Google LLC +# Copyright 2024 Google LLC # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -21,11 +21,10 @@ vars: deployment_name: pfs-daos region: us-central1 zone: us-central1-c - server_image_family: daos-server-hpc-rocky-8 - client_image_family: daos-client-hpc-rocky-8 - -# Documentation for each of the modules used below can be found at -# https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/modules/README.md + daos_server_image_family: daos-server-hpc-rocky-8 + daos_client_image_family: daos-client-hpc-rocky-8 + daos_version: "2.4" + tags: [] # Note: this blueprint assumes the existence of a default global network and # subnetwork in the region chosen above @@ -38,12 +37,12 @@ deployment_groups: - group: daos-server-image modules: - # more info: https://github.com/daos-stack/google-cloud-daos/tree/v0.4.1/images + # more info: https://github.com/daos-stack/google-cloud-daos/tree/main/images - id: daos-server-image - source: github.com/daos-stack/google-cloud-daos//images?ref=v0.4.1&depth=1 + source: "github.com/daos-stack/google-cloud-daos//images?ref=v0.5.0&depth=1" kind: packer settings: - daos_version: 2.2.0 + daos_version: $(vars.daos_version) daos_repo_base_url: https://packages.daos.io daos_packages_repo_file: EL8/packages/x86_64/daos_packages.repo use_iap: true @@ -58,16 +57,16 @@ deployment_groups: use_internal_ip: true omit_external_ip: false daos_install_type: server - image_family: $(vars.server_image_family) + image_family: $(vars.daos_server_image_family) - group: daos-client-image modules: - # more info: https://github.com/daos-stack/google-cloud-daos/tree/v0.4.1/images + # more info: https://github.com/daos-stack/google-cloud-daos/tree/v0.5.0/images - id: daos-client-image - source: github.com/daos-stack/google-cloud-daos//images?ref=v0.4.1&depth=1 + source: "github.com/daos-stack/google-cloud-daos//images?ref=v0.5.0&depth=1" kind: packer settings: - daos_version: 2.2.0 + daos_version: $(vars.daos_version) daos_repo_base_url: https://packages.daos.io daos_packages_repo_file: EL8/packages/x86_64/daos_packages.repo use_iap: true @@ -82,24 +81,29 @@ deployment_groups: use_internal_ip: true omit_external_ip: false daos_install_type: client - image_family: $(vars.client_image_family) + image_family: $(vars.daos_client_image_family) - group: daos-cluster modules: - # more info: https://github.com/daos-stack/google-cloud-daos/tree/v0.4.1/terraform/modules/daos_server + # more info: https://github.com/daos-stack/google-cloud-daos/tree/develop/terraform/modules/daos_server - id: daos-server - source: github.com/daos-stack/google-cloud-daos.git//terraform/modules/daos_server?ref=v0.4.1&depth=1 + # source: $(vars.daos_server_module_source_url) + source: "github.com/daos-stack/google-cloud-daos//terraform/modules/daos_server?ref=v0.5.0&depth=1" use: [network1] settings: number_of_instances: 2 labels: {ghpc_role: file-system} - os_family: $(vars.server_image_family) + os_family: $(vars.daos_server_image_family) + daos_scm_size: "172" + tags: $(vars.tags) - # more info: https://github.com/daos-stack/google-cloud-daos/tree/v0.4.1/terraform/modules/daos_client + # more info: https://github.com/daos-stack/google-cloud-daos/tree/develop/terraform/modules/daos_client - id: daos-client - source: github.com/daos-stack/google-cloud-daos.git//terraform/modules/daos_client?ref=v0.4.1&depth=1 + # source: $(vars.daos_client_module_source_url) + source: "github.com/daos-stack/google-cloud-daos//terraform/modules/daos_client?ref=v0.5.0&depth=1" use: [network1, daos-server] settings: number_of_instances: 2 labels: {ghpc_role: compute} - os_family: $(vars.client_image_family) + os_family: $(vars.daos_client_image_family) + tags: $(vars.tags) From 021d0e2b808c5bffcb5d9899e8f04bb98fe9ed66 Mon Sep 17 00:00:00 2001 From: Mark Olson <115657904+mark-olson@users.noreply.github.com> Date: Mon, 22 Jan 2024 09:43:03 -0800 Subject: [PATCH 2/4] Update README.md with fixes from review. Signed-off-by: Mark Olson <115657904+mark-olson@users.noreply.github.com> --- community/examples/intel/README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/community/examples/intel/README.md b/community/examples/intel/README.md index 8b0f4072d8..961000ac83 100644 --- a/community/examples/intel/README.md +++ b/community/examples/intel/README.md @@ -204,7 +204,7 @@ ls -lh "${HOME}/daos/cont1" A work-around for this issue to disable caching when mounting the container. -``` +```bash dfuse --singlethread --disable-caching --pool=pool1 --container=cont1 --mountpoint="${HOME}/daos/cont1" ``` @@ -234,7 +234,7 @@ See the [DFuse (DAOS FUSE)](https://docs.daos.io/v2.4/user/filesystem/?h=dfuse#d Delete the remaining infrastructure -```shell +```bash ghpc destroy pfs-daos --auto-approve ``` @@ -436,16 +436,15 @@ See the [DFuse (DAOS FUSE)](https://docs.daos.io/v2.4/user/filesystem/?h=dfuse#d ### Delete the DAOS/Slurm Cluster infrastructure when not in use -> **_NOTE:_** All data on the DAOS file system will be permanently lost after cluster deletion. - -> **_NOTE:_** If the Slurm controller is shut down before the auto-scale instances -> are destroyed those instances will be left running. +> **Note:** +> - Data on the DAOS file system will be permanently lost after cluster deletion. +> - If the Slurm controller is shut down before the auto-scale instances are destroyed, those compute instances will be left running. Open your browser to the VM instances page and ensure that instances named "compute" have been shutdown and deleted by the Slurm autoscaler. Delete the remaining infrastructure: -```shell +```bash ghpc destroy hpc-slurm-daos --auto-approve ``` From f9ab5c9d2840c6e11bbaa916e9f5e620ac07e485 Mon Sep 17 00:00:00 2001 From: Mark Olson <115657904+mark-olson@users.noreply.github.com> Date: Tue, 23 Jan 2024 09:35:01 -0800 Subject: [PATCH 3/4] Fixed PyMarkdown issue in community/examples/intel/README.md Signed-off-by: Mark Olson <115657904+mark-olson@users.noreply.github.com> --- community/examples/intel/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/community/examples/intel/README.md b/community/examples/intel/README.md index 961000ac83..932f9e3b52 100644 --- a/community/examples/intel/README.md +++ b/community/examples/intel/README.md @@ -436,7 +436,8 @@ See the [DFuse (DAOS FUSE)](https://docs.daos.io/v2.4/user/filesystem/?h=dfuse#d ### Delete the DAOS/Slurm Cluster infrastructure when not in use -> **Note:** +> **_NOTE:_** +> > - Data on the DAOS file system will be permanently lost after cluster deletion. > - If the Slurm controller is shut down before the auto-scale instances are destroyed, those compute instances will be left running. From 31db3c4158849bd94d40adb891f7b7acd9198af4 Mon Sep 17 00:00:00 2001 From: Mark Olson <115657904+mark-olson@users.noreply.github.com> Date: Tue, 23 Jan 2024 09:41:29 -0800 Subject: [PATCH 4/4] Fixed typos in community/examples/intel/README.md Signed-off-by: Mark Olson <115657904+mark-olson@users.noreply.github.com> --- community/examples/intel/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/community/examples/intel/README.md b/community/examples/intel/README.md index 932f9e3b52..0679b5fb01 100644 --- a/community/examples/intel/README.md +++ b/community/examples/intel/README.md @@ -149,7 +149,7 @@ Set ACLs to allow any user to create a container in *pool1*. sudo dmg pool update-acl -e A::EVERYONE@:rcta pool1 ``` -See the [Pool Operations](https://docs.daos.io/v2.4/admin/pool_operations) section of the of the DAOS Administration Guide for more information about creating pools. +See the [Pool Operations](https://docs.daos.io/v2.4/admin/pool_operations) section of the DAOS Administration Guide for more information about creating pools. #### Create a Container @@ -162,7 +162,7 @@ For the purpose of this demo create the container without specifying ACLs. The c daos container create --type=POSIX --properties=rf:0 pool1 cont1 ``` -See the [Container Management](https://docs.daos.io/v2.4/user/container) section of the of the DAOS User Guide for more information about creating containers. +See the [Container Management](https://docs.daos.io/v2.4/user/container) section of the DAOS User Guide for more information about creating containers. #### Mount the DAOS Container @@ -212,7 +212,7 @@ See the [File System](https://docs.daos.io/v2.4/user/filesystem/) section of the ### Unmount the DAOS Container -The container will need to by unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again. +The container will need to be unmounted before you log out. If this is not done it can leave open file handles and prevent the container from being mounted when you log in again. Verify that the container is unmounted