Skip to content

Commit

Permalink
Add A3-Mega support (#371)
Browse files Browse the repository at this point in the history
Adds support for the A3-Megagpu-8g SKU for GKE and MIG creation.
  • Loading branch information
Chris113113 authored May 3, 2024
2 parents 6ef47c4 + b80910d commit 6c779c9
Show file tree
Hide file tree
Showing 104 changed files with 5,645 additions and 5 deletions.
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ RUN curl -s "https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terra
&& mv ./terraform /root/.local/bin/terraform
COPY ./a3/terraform ./a3/terraform
COPY ./a2/terraform ./a2/terraform

COPY ./a3-mega/terraform ./a3-mega/terraform

FROM base as test
COPY test ./test
Expand All @@ -32,6 +32,8 @@ ENTRYPOINT ["./test/continuous/run.sh"]


FROM base as deploy
RUN for cluster in gke mig mig-cos; do \
terraform -chdir="./a3-mega/terraform/modules/cluster/${cluster}" init; done
RUN for cluster in gke gke-beta mig mig-cos slurm; do \
terraform -chdir="./a3/terraform/modules/cluster/${cluster}" init; done
RUN for cluster in mig; do \
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ the same as any other terraform:
# assuming the directory containing main.tf is the current working directory

# create/update the cluster
terraform init && terraform validate && terraform apply
terraform init && terraform validate && terraform apply -var-file="terraform.tfvars"

# destroy the cluster
terraform init && terraform validate && terraform apply -destroy
Expand Down
67 changes: 67 additions & 0 deletions a3-mega/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Overview

## Control Plane Options

A3-Mega clusters may be created through either [GKE](https://cloud.google.com/kubernetes-engine) or a [MIG](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups) via the modules found [here](./terraform/modules/cluster). Due to the recency of A3-Mega's release, features are limited in each control plane, and those limitations are listed below.

| Feature \ Module | `gke` | `mig-cos` |
| --- | --- | --- |
| [VM Image](https://cloud.google.com/compute/docs/images) | [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) | [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) |
| [Compact placement policy](https://cloud.google.com/compute/docs/instances/define-instance-placement) | Yes | Yes |
| [Kubernetes](https://kubernetes.io/) support | Yes | No |

## Quickstart with `gke`

An A3-Mega cluster of eight nodes (two node pools with four nodes each) booting with a COS-Cloud image can be created via GKE by running the following two commands:

```bash
cat >./terraform.tfvars <<EOF
project_id = "my-project"
region = "us-central1"
resource_prefix = "my-cluster"
node_pools = [
{
zone = "us-central1-c"
node_count = 4
},
{
zone = "us-central1-c"
node_count = 4
},
]
EOF

docker run --rm -v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
create a3-mega gke
```

A deeper dive into how to use this tool can be found at the [top-level README](../README.md#how-to-provision-a-cluster).

## Quickstart with `mig-cos`

An A3-Mega cluster of eight nodes (two instance groups with four instances each) booting with a COS-Cloud image can be created via a managed instance group by running the following two commands:

```bash
cat >./terraform.tfvars <<EOF
instance_groups = [
{
target_size = 4
zone = "us-central1-c"
},
{
target_size = 4
zone = "us-central1-c"
},
]
project_id = "my-project"
region = "us-central1"
resource_prefix = "my-cluster"
EOF

docker run --rm -v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
create a3-mega mig-cos
```

A deeper dive into how to use this tool can be found at the [top-level README](../README.md#how-to-provision-a-cluster).
62 changes: 62 additions & 0 deletions a3-mega/examples/gke/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# The cluster

This configuration creates a kubernetes service account which then creates two
GKE node pools of four
[`a3-megagpu-8g`](https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus)
#### TODO pirillo@: update to A3-mega blog
VM instances each (eight instances in total). Each instance has:
- eight [NVidia H100 GPUs](https://www.nvidia.com/en-us/data-center/h100/),
- nine [NICs](https://cloud.google.com/vpc/docs/multiple-interfaces-concepts)
(one VPC for the host network and eight dedicated to the GPUs),
- a [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) machine
image,
- TCPX, Nvidia GPU drivers, and NCCL plugin installed

# The tfvars file

The `terraform.tfvars` file is what configures the cluster. Detailed
descriptions of each variable can be found in
[this `README`](../../terraform/modules/cluster/gke/README.md).
All optional variables may be omitted to use their default values.

Required variables:
- `project_id`
- `resource_prefix`
- `region`
- `node_pools`

# How to create this cluster

Refer to [this section](../../../README.md#how-to-provision-a-cluster).

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

No requirements.

## Providers

No providers.

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_a3-mega-gke"></a> [a3-mega-gke](#module\_a3-mega-gke) | github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/gke | n/a |

## Resources

No resources.

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_node_pools"></a> [node\_pools](#input\_node\_pools) | n/a | `any` | n/a | yes |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | n/a | `any` | n/a | yes |
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | n/a | `any` | n/a | yes |

## Outputs

No outputs.
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
35 changes: 35 additions & 0 deletions a3-mega/examples/gke/blueprint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: a3-mega-gke

vars:
deployment_name: a3-mega-gke

node_pools:
- node_count: 4
zone: us-east4-a
- node_count: 4
zone: us-east4-a
project_id: my-project-id
region: us-east4
resource_prefix: my-cluster-name

deployment_groups:
- group: primary
modules:
- id: a3-mega-gke
source: "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/gke"
13 changes: 13 additions & 0 deletions a3-mega/examples/gke/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
variable "node_pools" {}
variable "project_id" {}
variable "resource_prefix" {}
variable "region" {}

module "a3-mega-gke" {
source = "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/gke"

node_pools = var.node_pools
project_id = var.project_id
resource_prefix = var.resource_prefix
region = var.region
}
61 changes: 61 additions & 0 deletions a3-mega/examples/mig-cos/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# The cluster

This configuration creates two Managed Instance Groups of four
[`a3-megagpu-8g`](https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus)
VM instances each (eight instances in total). Each instance has:
- eight [NVidia H100 GPUs](https://www.nvidia.com/en-us/data-center/h100/),
- nine [NICs](https://cloud.google.com/vpc/docs/multiple-interfaces-concepts)
(one VPC for the host network and eight dedicated to the GPUs),
- a [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) machine
image,
- TCPX, Nvidia GPU drivers, and NCCL plugin installed

# The tfvars file

The `terraform.tfvars` file is what configures the cluster. Detailed
descriptions of each variable can be found in
[this `README`](../../terraform/modules/cluster/mig-cos/README.md).
All optional variables may be omitted to use their default values.

Required variables:
- `instance_groups`
- `project_id`
- `region`
- `resource_prefix`

# How to create this cluster

Refer to [this section](../../../README.md#how-to-provision-a-cluster).

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

No requirements.

## Providers

No providers.

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_a3-mega-mig-cos"></a> [a3-mig-cos](#module\_a3-mega-mig-cos) | github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/mig-cos | n/a |

## Resources

No resources.

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | n/a | `any` | n/a | yes |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | n/a | `any` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | n/a | `any` | n/a | yes |
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | n/a | `any` | n/a | yes |

## Outputs

No outputs.
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
35 changes: 35 additions & 0 deletions a3-mega/examples/mig-cos/blueprint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: a3-mega-mig-cos

vars:
deployment_name: a3-mega-mig-cos

instance_groups:
- target_size: 4
zone: us-east4-a
- target_size: 4
zone: us-east4-a
project_id: my-project-id
region: us-east4
resource_prefix: my-cluster-name

deployment_groups:
- group: primary
modules:
- id: a3-mega-mig-cos
source: "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/mig-cos"
13 changes: 13 additions & 0 deletions a3-mega/examples/mig-cos/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
variable "instance_groups" {}
variable "project_id" {}
variable "region" {}
variable "resource_prefix" {}

module "a3-mig-cos" {
source = "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/mig-cos"

instance_groups = var.instance_groups
project_id = var.project_id
region = var.region
resource_prefix = var.resource_prefix
}
72 changes: 72 additions & 0 deletions a3-mega/terraform/modules/cluster/gke/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

## Requirements

No requirements.

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | n/a |
| <a name="provider_google-beta"></a> [google-beta](#provider\_google-beta) | n/a |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_dashboard"></a> [dashboard](#module\_dashboard) | ../../common/dashboard | n/a |
| <a name="module_kubectl-apply"></a> [kubectl-apply](#module\_kubectl-apply) | ./kubectl-apply | n/a |
| <a name="module_network"></a> [network](#module\_network) | ../../common/network | n/a |
| <a name="module_resource_policy"></a> [resource\_policy](#module\_resource\_policy) | ../../common/resource_policy | n/a |

## Resources

| Name | Type |
|------|------|
| [google-beta_google_container_cluster.cluster](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_container_cluster) | resource |
| [google-beta_google_container_node_pool.node-pools](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_container_node_pool) | resource |
| [google_project_iam_member.node_service_account_logWriter](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/project_iam_member) | resource |
| [google_project_iam_member.node_service_account_metricWriter](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/project_iam_member) | resource |
| [google_project_iam_member.node_service_account_monitoringViewer](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/project_iam_member) | resource |
| [google_client_config.current](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/client_config) | data source |
| [google_compute_default_service_account.account](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_default_service_account) | data source |
| [google_container_engine_versions.gkeversion](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/container_engine_versions) | data source |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Size of the disk attached to each node, specified in GB. The smallest allowed disk size is 10GB. Defaults to 200GB.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_size_gb), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-size). | `number` | `200` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Type of the disk attached to each node. The default disk type is 'pd-standard'<br><br>Possible values: `["pd-ssd", "local-ssd", "pd-balanced", "pd-standard"]`<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-type). | `string` | `"pd-ssd"` | no |
| <a name="input_enable_gke_dashboard"></a> [enable\_gke\_dashboard](#input\_enable\_gke\_dashboard) | Flag to enable GPU usage dashboards for the GKE cluster. | `bool` | `true` | no |
| <a name="input_gke_version"></a> [gke\_version](#input\_gke\_version) | The GKE version to be used as the minimum version of the master. The default value for that is latest master version.<br>More details can be found [here](https://cloud.google.com/kubernetes-engine/versioning#specifying_cluster_version)<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--name). | `string` | `null` | no |
| <a name="input_host_maintenance_interval"></a> [host\_maintenance\_interval](#input\_host\_maintenance\_interval) | Specifies the frequency of planned maintenance events. 'PERIODIC' is th only supported value for host\_maintenance\_interval. This enables using stable fleet VM. | `string` | `"PERIODIC"` | no |
| <a name="input_ksa"></a> [ksa](#input\_ksa) | The configuration for setting up Kubernetes Service Account (KSA) after GKE<br>cluster is created. Disable by setting to null.<br><br>- `name`: The KSA name to be used for Pods<br>- `namespace`: The KSA namespace to be used for Pods<br><br>Related Docs: [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) | <pre>object({<br> name = string<br> namespace = string<br> })</pre> | <pre>{<br> "name": "aiinfra-gke-sa",<br> "namespace": "default"<br>}</pre> | no |
| <a name="input_network_existing"></a> [network\_existing](#input\_network\_existing) | Existing network to attach to nic0. Setting to null will create a new network for it. | <pre>object({<br> network_name = string<br> subnetwork_name = string<br> })</pre> | `null` | no |
| <a name="input_node_pools"></a> [node\_pools](#input\_node\_pools) | The list of node pools for the GKE cluster.<br>- `zone`: The zone in which the node pool's nodes should be located. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_locations)<br>- `node_count`: The number of nodes per node pool. This field can be used to update the number of nodes per node pool. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_count)<br>- `machine_type`: (Optional) The machine type for the node pool. Only supported machine types are 'a3-highgpu-8g' and 'a2-highgpu-1g'. [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#machine_type)<br>- `compact_placement_policy`:(Optional) The object for superblock level compact placement policy for the instances. Currently only 1 resource policy is supported. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#policy_name)<br> - `new_policy`: (Optional) Flag for creating a new resource policy.<br> - `existing_policy_name`: (Optional) The existing resource policy. | <pre>list(object({<br> zone = string,<br> node_count = number,<br> machine_type = optional(string, "a3-highgpu-8g"),<br> compact_placement_policy = optional(object({<br> new_policy = optional(bool, false)<br> existing_policy_name = optional(string)<br> specific_reservation = optional(string)<br> }))<br> }))</pre> | `[]` | no |
| <a name="input_node_service_account"></a> [node\_service\_account](#input\_node\_service\_account) | The service account to be used by the Node VMs. If not specified, the "default" service account is used.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#nested_node_config), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--service-account). | `string` | `null` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | GCP Project ID to which the cluster will be deployed. | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | The region in which the cluster master will be created. The cluster will be a regional cluster with multiple masters spread across zones in the region, and with default node locations in those zones as well. | `string` | n/a | yes |
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | Arbitrary string with which all names of newly created resources will be prefixed. | `string` | n/a | yes |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_id"></a> [id](#output\_id) | Google Kubernetes cluster id |
| <a name="output_name"></a> [name](#output\_name) | Google Kubernetes cluster name |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Loading

0 comments on commit 6c779c9

Please sign in to comment.