-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds support for the A3-Megagpu-8g SKU for GKE and MIG creation.
- Loading branch information
Showing
104 changed files
with
5,645 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Overview | ||
|
||
## Control Plane Options | ||
|
||
A3-Mega clusters may be created through either [GKE](https://cloud.google.com/kubernetes-engine) or a [MIG](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups) via the modules found [here](./terraform/modules/cluster). Due to the recency of A3-Mega's release, features are limited in each control plane, and those limitations are listed below. | ||
|
||
| Feature \ Module | `gke` | `mig-cos` | | ||
| --- | --- | --- | | ||
| [VM Image](https://cloud.google.com/compute/docs/images) | [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) | [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) | | ||
| [Compact placement policy](https://cloud.google.com/compute/docs/instances/define-instance-placement) | Yes | Yes | | ||
| [Kubernetes](https://kubernetes.io/) support | Yes | No | | ||
|
||
## Quickstart with `gke` | ||
|
||
An A3-Mega cluster of eight nodes (two node pools with four nodes each) booting with a COS-Cloud image can be created via GKE by running the following two commands: | ||
|
||
```bash | ||
cat >./terraform.tfvars <<EOF | ||
project_id = "my-project" | ||
region = "us-central1" | ||
resource_prefix = "my-cluster" | ||
node_pools = [ | ||
{ | ||
zone = "us-central1-c" | ||
node_count = 4 | ||
}, | ||
{ | ||
zone = "us-central1-c" | ||
node_count = 4 | ||
}, | ||
] | ||
EOF | ||
|
||
docker run --rm -v "${PWD}:/root/aiinfra/input" \ | ||
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \ | ||
create a3-mega gke | ||
``` | ||
|
||
A deeper dive into how to use this tool can be found at the [top-level README](../README.md#how-to-provision-a-cluster). | ||
|
||
## Quickstart with `mig-cos` | ||
|
||
An A3-Mega cluster of eight nodes (two instance groups with four instances each) booting with a COS-Cloud image can be created via a managed instance group by running the following two commands: | ||
|
||
```bash | ||
cat >./terraform.tfvars <<EOF | ||
instance_groups = [ | ||
{ | ||
target_size = 4 | ||
zone = "us-central1-c" | ||
}, | ||
{ | ||
target_size = 4 | ||
zone = "us-central1-c" | ||
}, | ||
] | ||
project_id = "my-project" | ||
region = "us-central1" | ||
resource_prefix = "my-cluster" | ||
EOF | ||
|
||
docker run --rm -v "${PWD}:/root/aiinfra/input" \ | ||
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \ | ||
create a3-mega mig-cos | ||
``` | ||
|
||
A deeper dive into how to use this tool can be found at the [top-level README](../README.md#how-to-provision-a-cluster). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# The cluster | ||
|
||
This configuration creates a kubernetes service account which then creates two | ||
GKE node pools of four | ||
[`a3-megagpu-8g`](https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus) | ||
#### TODO pirillo@: update to A3-mega blog | ||
VM instances each (eight instances in total). Each instance has: | ||
- eight [NVidia H100 GPUs](https://www.nvidia.com/en-us/data-center/h100/), | ||
- nine [NICs](https://cloud.google.com/vpc/docs/multiple-interfaces-concepts) | ||
(one VPC for the host network and eight dedicated to the GPUs), | ||
- a [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) machine | ||
image, | ||
- TCPX, Nvidia GPU drivers, and NCCL plugin installed | ||
|
||
# The tfvars file | ||
|
||
The `terraform.tfvars` file is what configures the cluster. Detailed | ||
descriptions of each variable can be found in | ||
[this `README`](../../terraform/modules/cluster/gke/README.md). | ||
All optional variables may be omitted to use their default values. | ||
|
||
Required variables: | ||
- `project_id` | ||
- `resource_prefix` | ||
- `region` | ||
- `node_pools` | ||
|
||
# How to create this cluster | ||
|
||
Refer to [this section](../../../README.md#how-to-provision-a-cluster). | ||
|
||
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK --> | ||
## Requirements | ||
|
||
No requirements. | ||
|
||
## Providers | ||
|
||
No providers. | ||
|
||
## Modules | ||
|
||
| Name | Source | Version | | ||
|------|--------|---------| | ||
| <a name="module_a3-mega-gke"></a> [a3-mega-gke](#module\_a3-mega-gke) | github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/gke | n/a | | ||
|
||
## Resources | ||
|
||
No resources. | ||
|
||
## Inputs | ||
|
||
| Name | Description | Type | Default | Required | | ||
|------|-------------|------|---------|:--------:| | ||
| <a name="input_node_pools"></a> [node\_pools](#input\_node\_pools) | n/a | `any` | n/a | yes | | ||
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | n/a | `any` | n/a | yes | | ||
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | n/a | `any` | n/a | yes | | ||
|
||
## Outputs | ||
|
||
No outputs. | ||
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Copyright 2022 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
--- | ||
|
||
blueprint_name: a3-mega-gke | ||
|
||
vars: | ||
deployment_name: a3-mega-gke | ||
|
||
node_pools: | ||
- node_count: 4 | ||
zone: us-east4-a | ||
- node_count: 4 | ||
zone: us-east4-a | ||
project_id: my-project-id | ||
region: us-east4 | ||
resource_prefix: my-cluster-name | ||
|
||
deployment_groups: | ||
- group: primary | ||
modules: | ||
- id: a3-mega-gke | ||
source: "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/gke" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
variable "node_pools" {} | ||
variable "project_id" {} | ||
variable "resource_prefix" {} | ||
variable "region" {} | ||
|
||
module "a3-mega-gke" { | ||
source = "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/gke" | ||
|
||
node_pools = var.node_pools | ||
project_id = var.project_id | ||
resource_prefix = var.resource_prefix | ||
region = var.region | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# The cluster | ||
|
||
This configuration creates two Managed Instance Groups of four | ||
[`a3-megagpu-8g`](https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus) | ||
VM instances each (eight instances in total). Each instance has: | ||
- eight [NVidia H100 GPUs](https://www.nvidia.com/en-us/data-center/h100/), | ||
- nine [NICs](https://cloud.google.com/vpc/docs/multiple-interfaces-concepts) | ||
(one VPC for the host network and eight dedicated to the GPUs), | ||
- a [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) machine | ||
image, | ||
- TCPX, Nvidia GPU drivers, and NCCL plugin installed | ||
|
||
# The tfvars file | ||
|
||
The `terraform.tfvars` file is what configures the cluster. Detailed | ||
descriptions of each variable can be found in | ||
[this `README`](../../terraform/modules/cluster/mig-cos/README.md). | ||
All optional variables may be omitted to use their default values. | ||
|
||
Required variables: | ||
- `instance_groups` | ||
- `project_id` | ||
- `region` | ||
- `resource_prefix` | ||
|
||
# How to create this cluster | ||
|
||
Refer to [this section](../../../README.md#how-to-provision-a-cluster). | ||
|
||
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK --> | ||
## Requirements | ||
|
||
No requirements. | ||
|
||
## Providers | ||
|
||
No providers. | ||
|
||
## Modules | ||
|
||
| Name | Source | Version | | ||
|------|--------|---------| | ||
| <a name="module_a3-mega-mig-cos"></a> [a3-mig-cos](#module\_a3-mega-mig-cos) | github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/mig-cos | n/a | | ||
|
||
## Resources | ||
|
||
No resources. | ||
|
||
## Inputs | ||
|
||
| Name | Description | Type | Default | Required | | ||
|------|-------------|------|---------|:--------:| | ||
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | n/a | `any` | n/a | yes | | ||
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | n/a | `any` | n/a | yes | | ||
| <a name="input_region"></a> [region](#input\_region) | n/a | `any` | n/a | yes | | ||
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | n/a | `any` | n/a | yes | | ||
|
||
## Outputs | ||
|
||
No outputs. | ||
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Copyright 2022 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
--- | ||
|
||
blueprint_name: a3-mega-mig-cos | ||
|
||
vars: | ||
deployment_name: a3-mega-mig-cos | ||
|
||
instance_groups: | ||
- target_size: 4 | ||
zone: us-east4-a | ||
- target_size: 4 | ||
zone: us-east4-a | ||
project_id: my-project-id | ||
region: us-east4 | ||
resource_prefix: my-cluster-name | ||
|
||
deployment_groups: | ||
- group: primary | ||
modules: | ||
- id: a3-mega-mig-cos | ||
source: "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/mig-cos" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
variable "instance_groups" {} | ||
variable "project_id" {} | ||
variable "region" {} | ||
variable "resource_prefix" {} | ||
|
||
module "a3-mig-cos" { | ||
source = "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a3-mega/terraform/modules/cluster/mig-cos" | ||
|
||
instance_groups = var.instance_groups | ||
project_id = var.project_id | ||
region = var.region | ||
resource_prefix = var.resource_prefix | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK --> | ||
Copyright 2022 Google LLC | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
|
||
## Requirements | ||
|
||
No requirements. | ||
|
||
## Providers | ||
|
||
| Name | Version | | ||
|------|---------| | ||
| <a name="provider_google"></a> [google](#provider\_google) | n/a | | ||
| <a name="provider_google-beta"></a> [google-beta](#provider\_google-beta) | n/a | | ||
|
||
## Modules | ||
|
||
| Name | Source | Version | | ||
|------|--------|---------| | ||
| <a name="module_dashboard"></a> [dashboard](#module\_dashboard) | ../../common/dashboard | n/a | | ||
| <a name="module_kubectl-apply"></a> [kubectl-apply](#module\_kubectl-apply) | ./kubectl-apply | n/a | | ||
| <a name="module_network"></a> [network](#module\_network) | ../../common/network | n/a | | ||
| <a name="module_resource_policy"></a> [resource\_policy](#module\_resource\_policy) | ../../common/resource_policy | n/a | | ||
|
||
## Resources | ||
|
||
| Name | Type | | ||
|------|------| | ||
| [google-beta_google_container_cluster.cluster](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_container_cluster) | resource | | ||
| [google-beta_google_container_node_pool.node-pools](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_container_node_pool) | resource | | ||
| [google_project_iam_member.node_service_account_logWriter](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/project_iam_member) | resource | | ||
| [google_project_iam_member.node_service_account_metricWriter](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/project_iam_member) | resource | | ||
| [google_project_iam_member.node_service_account_monitoringViewer](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/project_iam_member) | resource | | ||
| [google_client_config.current](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/client_config) | data source | | ||
| [google_compute_default_service_account.account](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_default_service_account) | data source | | ||
| [google_container_engine_versions.gkeversion](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/container_engine_versions) | data source | | ||
|
||
## Inputs | ||
|
||
| Name | Description | Type | Default | Required | | ||
|------|-------------|------|---------|:--------:| | ||
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Size of the disk attached to each node, specified in GB. The smallest allowed disk size is 10GB. Defaults to 200GB.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_size_gb), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-size). | `number` | `200` | no | | ||
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Type of the disk attached to each node. The default disk type is 'pd-standard'<br><br>Possible values: `["pd-ssd", "local-ssd", "pd-balanced", "pd-standard"]`<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-type). | `string` | `"pd-ssd"` | no | | ||
| <a name="input_enable_gke_dashboard"></a> [enable\_gke\_dashboard](#input\_enable\_gke\_dashboard) | Flag to enable GPU usage dashboards for the GKE cluster. | `bool` | `true` | no | | ||
| <a name="input_gke_version"></a> [gke\_version](#input\_gke\_version) | The GKE version to be used as the minimum version of the master. The default value for that is latest master version.<br>More details can be found [here](https://cloud.google.com/kubernetes-engine/versioning#specifying_cluster_version)<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--name). | `string` | `null` | no | | ||
| <a name="input_host_maintenance_interval"></a> [host\_maintenance\_interval](#input\_host\_maintenance\_interval) | Specifies the frequency of planned maintenance events. 'PERIODIC' is th only supported value for host\_maintenance\_interval. This enables using stable fleet VM. | `string` | `"PERIODIC"` | no | | ||
| <a name="input_ksa"></a> [ksa](#input\_ksa) | The configuration for setting up Kubernetes Service Account (KSA) after GKE<br>cluster is created. Disable by setting to null.<br><br>- `name`: The KSA name to be used for Pods<br>- `namespace`: The KSA namespace to be used for Pods<br><br>Related Docs: [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) | <pre>object({<br> name = string<br> namespace = string<br> })</pre> | <pre>{<br> "name": "aiinfra-gke-sa",<br> "namespace": "default"<br>}</pre> | no | | ||
| <a name="input_network_existing"></a> [network\_existing](#input\_network\_existing) | Existing network to attach to nic0. Setting to null will create a new network for it. | <pre>object({<br> network_name = string<br> subnetwork_name = string<br> })</pre> | `null` | no | | ||
| <a name="input_node_pools"></a> [node\_pools](#input\_node\_pools) | The list of node pools for the GKE cluster.<br>- `zone`: The zone in which the node pool's nodes should be located. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_locations)<br>- `node_count`: The number of nodes per node pool. This field can be used to update the number of nodes per node pool. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_count)<br>- `machine_type`: (Optional) The machine type for the node pool. Only supported machine types are 'a3-highgpu-8g' and 'a2-highgpu-1g'. [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#machine_type)<br>- `compact_placement_policy`:(Optional) The object for superblock level compact placement policy for the instances. Currently only 1 resource policy is supported. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#policy_name)<br> - `new_policy`: (Optional) Flag for creating a new resource policy.<br> - `existing_policy_name`: (Optional) The existing resource policy. | <pre>list(object({<br> zone = string,<br> node_count = number,<br> machine_type = optional(string, "a3-highgpu-8g"),<br> compact_placement_policy = optional(object({<br> new_policy = optional(bool, false)<br> existing_policy_name = optional(string)<br> specific_reservation = optional(string)<br> }))<br> }))</pre> | `[]` | no | | ||
| <a name="input_node_service_account"></a> [node\_service\_account](#input\_node\_service\_account) | The service account to be used by the Node VMs. If not specified, the "default" service account is used.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#nested_node_config), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--service-account). | `string` | `null` | no | | ||
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | GCP Project ID to which the cluster will be deployed. | `string` | n/a | yes | | ||
| <a name="input_region"></a> [region](#input\_region) | The region in which the cluster master will be created. The cluster will be a regional cluster with multiple masters spread across zones in the region, and with default node locations in those zones as well. | `string` | n/a | yes | | ||
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | Arbitrary string with which all names of newly created resources will be prefixed. | `string` | n/a | yes | | ||
|
||
## Outputs | ||
|
||
| Name | Description | | ||
|------|-------------| | ||
| <a name="output_id"></a> [id](#output\_id) | Google Kubernetes cluster id | | ||
| <a name="output_name"></a> [name](#output\_name) | Google Kubernetes cluster name | | ||
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK --> |
Oops, something went wrong.