Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertex Pipelines MLOps framework blueprint #1038

Merged
merged 48 commits into from
Feb 2, 2023
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
7c46512
First MLOps automated environment version
javiergp Sep 6, 2022
bafee4f
Add MLOPS.md
javiergp Nov 3, 2022
2fcd5b4
Updated branch to master
javiergp Nov 19, 2022
e43105f
Projects reorg
javiergp Nov 22, 2022
dc39bd2
Fix cloudbuild bucket
javiergp Nov 22, 2022
be32d72
Fixed projects yaml files
javiergp Dec 3, 2022
00ecc2e
Fixed staging env
javiergp Dec 4, 2022
65e5de4
Added tests
javiergp Dec 5, 2022
5c66de5
Removed CICD project
javiergp Dec 5, 2022
1a258ae
Removed CICD project
javiergp Dec 5, 2022
377448a
Improved doc
javiergp Dec 5, 2022
ce570de
Fixed sample files
javiergp Dec 6, 2022
4a22bc9
Fixed licenses and tests
javiergp Dec 7, 2022
5cc1e1e
Fixed linting
javiergp Dec 7, 2022
1313d12
Merge branch 'master' into jgpuga/mlops
javiergp Dec 7, 2022
ea4b598
Fixed linting
javiergp Dec 7, 2022
7b2e761
Simplified blueprint
javiergp Jan 17, 2023
176b50b
Fixed linting
javiergp Jan 17, 2023
db9d011
Fixed tests
javiergp Jan 17, 2023
38b907a
Merge branch 'master' into jgpuga/mlops
javiergp Jan 17, 2023
a0d2302
Fixed PR comments
javiergp Jan 18, 2023
f8134cf
Merge branch 'jgpuga/mlops' of https://github.com/GoogleCloudPlatform…
javiergp Jan 19, 2023
8510b38
Fixed PR comments
javiergp Jan 19, 2023
f41a979
Merge branch 'master' into jgpuga/mlops
javiergp Jan 19, 2023
3f269a1
Fixed linting
javiergp Jan 19, 2023
25d2fa4
Fixed linting
javiergp Jan 19, 2023
29f6e17
Fixed linting
javiergp Jan 19, 2023
f33456a
Fixed PR comments
javiergp Jan 22, 2023
a7167fd
Merge branch 'master' into jgpuga/mlops
javiergp Jan 22, 2023
1a0b40c
Improved README.md
javiergp Jan 23, 2023
ed8c6dd
Merge branch 'master' into jgpuga/mlops
javiergp Jan 23, 2023
937ffbd
Add CMEK support
javiergp Jan 24, 2023
a7ec1b2
Merge branch 'master' into jgpuga/mlops
javiergp Jan 24, 2023
eb3eb85
Merge branch 'jgpuga/mlops' of https://github.com/GoogleCloudPlatform…
javiergp Jan 24, 2023
e0317db
Improved README.md
javiergp Jan 25, 2023
9b95f9b
Merge branch 'master' into jgpuga/mlops
javiergp Jan 25, 2023
8e5f1d8
Fixed bucket name
javiergp Jan 25, 2023
0650e78
Fixed null values in variables
javiergp Jan 26, 2023
0aa792f
Merge branch 'master' into jgpuga/mlops
javiergp Jan 26, 2023
bc82695
Merge branch 'master' into jgpuga/mlops
javiergp Jan 30, 2023
d80bc5f
Added opinionated groups
javiergp Jan 31, 2023
e9f1880
Merge branch 'master' into jgpuga/mlops
javiergp Jan 31, 2023
cfe2eb7
Merge branch 'master' into jgpuga/mlops
javiergp Jan 31, 2023
4bf6fac
Merge branch 'master' into jgpuga/mlops
ludoo Feb 2, 2023
45568a8
Merge branch 'master' into jgpuga/mlops
javiergp Feb 2, 2023
37915ce
Linking MLOPs blueprint to top level README files
javiergp Feb 2, 2023
593cea0
Merge branch 'jgpuga/mlops' of https://github.com/GoogleCloudPlatform…
javiergp Feb 2, 2023
d28a5bb
Linking MLOPs blueprint to top level README files
javiergp Feb 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions blueprints/data-solutions/vertex-mlops/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# MLOps with Vertex AI - Infra setup
javiergp marked this conversation as resolved.
Show resolved Hide resolved
javiergp marked this conversation as resolved.
Show resolved Hide resolved

## Introduction
This example implements the infrastructure required to deploy an end-to-end [MLOps process](https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf) using [Vertex AI](https://cloud.google.com/vertex-ai) platform.

## GCP resources
The blueprint will deploy all the required resources to have a fully functional MLOPs environment containing:

- GCP Project to host all the resources
- Isolated VPC network and a subnet to be used by Vertex and Dataflow (using a Shared VPC is also possible).
javiergp marked this conversation as resolved.
Show resolved Hide resolved
- Firewall rule to allow the internal subnet communication required by Dataflow
- Cloud NAT required to reach the internet from the different computing resources (Vertex and Dataflow)
- GCS buckets to host Vertex AI and Cloud Build Artifacts. By default the buckets will be regional and should match the Vertex AI region for the different resources (i.e. Vertex Managed Dataset) and processes (i.e. Vertex trainining)
- BigQuery Dataset where the training data will be stored. This is optional, since the training data could be already hosted in an existing BigQuery dataset.
- Service account (`mlops-[env]@`) with the minimum permissions required by Vertex and Dataflow
- Service account (`github@`) to be used by Workload Identity Federation, to federate Github identity (Optional).
- Secret to store the Github SSH key to get access the CICD code repo.

![MLOps project description](./images/mlops_projects.png "MLOps project description")
javiergp marked this conversation as resolved.
Show resolved Hide resolved
javiergp marked this conversation as resolved.
Show resolved Hide resolved

## Pre-requirements

### User groups

Assign roles relying on User groups is a way to decouple the final set of permissions from the stage where entities and resources are created, and their IAM bindings defined. These groups should be created before launching Terraform.

We use the following groups to control access to resources:

- *Data Scientits* (gcp-ml-ds@<company.org>). They create ML pipelines in the experimentation environment.
- *ML Engineers* (gcp-ml-eng@<company.org>). They manage and run the different environments, with access to all resources in order to troubleshoot possible issues with pipelines.

These groups are not suitable for production grade environments. You can configure the group names through the `groups`variable.
javiergp marked this conversation as resolved.
Show resolved Hide resolved

## Instructions
### Deploy the experimentation environment

- Create a `terraform.tfvars` file and specify the variables to match your desired configuration. You can use the provided `terraform.tfvars.sample` as reference.
- Make sure you have the right authentication setup (application default credentials, or a service account key)
javiergp marked this conversation as resolved.
Show resolved Hide resolved
- Run `terraform init` and `terraform apply`
- It is possible that some errors like `googleapi: Error 400: Service account xxxx does not exist.` appears. This is due to some dependencies with the Project IAM authoritative bindings of the service accounts. In this case, re-run again the process with `terraform apply`
javiergp marked this conversation as resolved.
Show resolved Hide resolved

## What's next?

Once the environment is deployed, you can follow this [guide](https://github.com/javiergp/professional-services/blob/main/examples/vertex_mlops_enterprise/README.md) to setup the Vertex AI pipeline and run it on the deployed infraestructure.
javiergp marked this conversation as resolved.
Show resolved Hide resolved
<!-- BEGIN TFDOC -->

## Variables

| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [project_id](variables.tf#L93) | Project id, references existing project if `project_create` is null. | <code>string</code> | ✓ | |
| [bucket_name](variables.tf#L18) | GCS bucket name to store the Vertex AI artifacts. | <code>string</code> | | <code>null</code> |
| [dataset_name](variables.tf#L24) | BigQuery Dataset to store the training data. | <code>string</code> | | <code>null</code> |
| [group_iam](variables.tf#L31) | Authoritative IAM binding for the project, in {GROUP_EMAIL => [ROLES]} format. | <code>map&#40;list&#40;string&#41;&#41;</code> | | <code>&#123;&#125;</code> |
| [identity_pool_claims](variables.tf#L38) | Claims to be used by Workload Identity Federation (i.e.: attribute.repository/ORGANIZATION/REPO). If a not null value is provided, then google_iam_workload_identity_pool resource will be created. | <code>string</code> | | <code>null</code> |
| [labels](variables.tf#L44) | Labels to be assigned at project level. | <code>map&#40;string&#41;</code> | | <code>&#123;&#125;</code> |
| [location](variables.tf#L50) | Location used for multi-regional resources. | <code>string</code> | | <code>&#34;eu&#34;</code> |
| [network_config](variables.tf#L56) | Shared VPC network configurations to use. If null networks will be created in projects with preconfigured values. | <code title="object&#40;&#123;&#10; host_project &#61; string&#10; network_self_link &#61; string&#10; subnet_self_link &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code>null</code> |
| [notebooks](variables.tf#L66) | Vertex AI workbenchs to be deployed. | <code title="map&#40;object&#40;&#123;&#10; owner &#61; string&#10; region &#61; string&#10; subnet &#61; string&#10; internal_ip_only &#61; optional&#40;bool, false&#41;&#10; idle_shutdown &#61; optional&#40;bool&#41;&#10;&#125;&#41;&#41;">map&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;</code> | | <code>&#123;&#125;</code> |
| [prefix](variables.tf#L78) | Prefix used for the project id. | <code>string</code> | | <code>null</code> |
| [project_create](variables.tf#L84) | Provide values if project creation is needed, uses existing project if null. Parent is in 'folders/nnn' or 'organizations/nnn' format. | <code title="object&#40;&#123;&#10; billing_account_id &#61; string&#10; parent &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code>null</code> |
| [project_services](variables.tf#L98) | List of core services enabled on all projects. | <code>list&#40;string&#41;</code> | | <code title="&#91;&#10; &#34;aiplatform.googleapis.com&#34;,&#10; &#34;artifactregistry.googleapis.com&#34;,&#10; &#34;bigquery.googleapis.com&#34;,&#10; &#34;cloudbuild.googleapis.com&#34;,&#10; &#34;compute.googleapis.com&#34;,&#10; &#34;datacatalog.googleapis.com&#34;,&#10; &#34;dataflow.googleapis.com&#34;,&#10; &#34;iam.googleapis.com&#34;,&#10; &#34;monitoring.googleapis.com&#34;,&#10; &#34;notebooks.googleapis.com&#34;,&#10; &#34;secretmanager.googleapis.com&#34;,&#10; &#34;servicenetworking.googleapis.com&#34;,&#10; &#34;serviceusage.googleapis.com&#34;&#10;&#93;">&#91;&#8230;&#93;</code> |
| [region](variables.tf#L118) | Region used for regional resources. | <code>string</code> | | <code>&#34;europe-west4&#34;</code> |
| [repo_name](variables.tf#L124) | Cloud Source Repository name. null to avoid to create it. | <code>string</code> | | <code>null</code> |
| [sa_mlops_name](variables.tf#L130) | Name for the MLOPs Service Account. | <code>string</code> | | <code>&#34;sa-mlops&#34;</code> |

## Outputs

| name | description | sensitive |
|---|---|:---:|
| [github](outputs.tf#L33) | Github Configuration. | |
| [notebook](outputs.tf#L39) | Vertex AI managed notebook details. | |
| [project](outputs.tf#L44) | The project resource as return by the `project` module. | |
| [project_id](outputs.tf#L49) | Project ID. | |

<!-- END TFDOC -->
74 changes: 74 additions & 0 deletions blueprints/data-solutions/vertex-mlops/ci-cd.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/**
* Copyright 2022 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

resource "google_iam_workload_identity_pool" "github_pool" {
count = var.identity_pool_claims == null ? 0 : 1
project = module.project.project_id
workload_identity_pool_id = "gh-pool"
display_name = "Github Actions Identity Pool"
description = "Identity pool for Github Actions"
}

resource "google_iam_workload_identity_pool_provider" "github_provider" {
count = var.identity_pool_claims == null ? 0 : 1
project = module.project.project_id
workload_identity_pool_id = google_iam_workload_identity_pool.github_pool[0].workload_identity_pool_id
workload_identity_pool_provider_id = "gh-provider"
display_name = "Github Actions provider"
description = "OIDC provider for Github Actions"
attribute_mapping = {
"google.subject" = "assertion.sub"
"attribute.repository" = "assertion.repository"
}
oidc {
issuer_uri = "https://token.actions.githubusercontent.com"
}
}

module "artifact_registry" {
source = "../../../modules/artifact-registry"
id = "docker-repo"
project_id = module.project.project_id
location = var.region
format = "DOCKER"
# iam = {
# "roles/artifactregistry.admin" = ["group:[email protected]"]
# }
}

module "service-account-github" {
source = "../../../modules/iam-service-account"
name = "sa-github"
project_id = module.project.project_id
iam = var.identity_pool_claims == null ? {} : { "roles/iam.workloadIdentityUser" = ["principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.github_pool[0].name}/${var.identity_pool_claims}"] }
}

# NOTE: Secret manager module at the moment does not support CMEK
module "secret-manager" {
javiergp marked this conversation as resolved.
Show resolved Hide resolved
project_id = module.project.project_id
source = "../../../modules/secret-manager"
secrets = {
github-key = [var.region]
}
iam = {
github-key = {
javiergp marked this conversation as resolved.
Show resolved Hide resolved
"roles/secretmanager.secretAccessor" = [
"serviceAccount:${module.project.service_accounts.robots.cloudbuild}",
module.service-account-mlops.iam_email
]
}
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
230 changes: 230 additions & 0 deletions blueprints/data-solutions/vertex-mlops/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
/**
* Copyright 2022 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/


locals {
service_encryption_keys = var.service_encryption_keys
shared_vpc_project = try(var.network_config.host_project, null)

subnet = (
local.use_shared_vpc
? var.network_config.subnet_self_link
: values(module.vpc-local.0.subnet_self_links)[0]
)
vpc = (
local.use_shared_vpc
? var.network_config.network_self_link
: module.vpc-local.0.self_link
)
use_shared_vpc = var.network_config != null

shared_vpc_bindings = {
"roles/compute.networkUser" = [
"robot-df", "notebooks"
]
}

shared_vpc_role_members = {
robot-df = "serviceAccount:${module.project.service_accounts.robots.dataflow}"
notebooks = "serviceAccount:${module.project.service_accounts.robots.notebooks}"
}

# reassemble in a format suitable for for_each
shared_vpc_bindings_map = {
for binding in flatten([
for role, members in local.shared_vpc_bindings : [
for member in members : { role = role, member = member }
]
]) : "${binding.role}-${binding.member}" => binding
}
}

module "gcs-bucket" {
count = var.bucket_name == null ? 0 : 1
source = "../../../modules/gcs"
project_id = module.project.project_id
name = var.bucket_name
prefix = var.prefix
location = var.region
storage_class = "REGIONAL"
versioning = false
encryption_key = try(local.service_encryption_keys.storage, null) # Example assignment of an encryption key
javiergp marked this conversation as resolved.
Show resolved Hide resolved
}

# Default bucket for Cloud Build to prevent error: "'us' violates constraint ‘constraints/gcp.resourceLocations’"
# https://stackoverflow.com/questions/53206667/cloud-build-fails-with-resource-location-constraint
module "gcs-bucket-cloudbuild" {
source = "../../../modules/gcs"
project_id = module.project.project_id
name = "${var.project_id}_cloudbuild"
prefix = var.prefix
location = var.region
storage_class = "REGIONAL"
versioning = false
encryption_key = try(local.service_encryption_keys.storage, null) # Example assignment of an encryption key
javiergp marked this conversation as resolved.
Show resolved Hide resolved
}

module "bq-dataset" {
count = var.dataset_name == null ? 0 : 1
source = "../../../modules/bigquery-dataset"
project_id = module.project.project_id
id = var.dataset_name
location = var.region
encryption_key = try(local.service_encryption_keys.bq, null) # Example assignment of an encryption key
javiergp marked this conversation as resolved.
Show resolved Hide resolved
}

module "vpc-local" {
count = local.use_shared_vpc ? 0 : 1
source = "../../../modules/net-vpc"
project_id = module.project.project_id
name = "default"
javiergp marked this conversation as resolved.
Show resolved Hide resolved
subnets = [
{
"name" : "default",
javiergp marked this conversation as resolved.
Show resolved Hide resolved
"region" : "${var.region}",
"ip_cidr_range" : "10.4.0.0/24",
"secondary_ip_range" : null
}
]
psa_config = {
ranges = {
"vertex" : "10.13.0.0/18"
}
routes = null
}
}

module "firewall" {
count = local.use_shared_vpc ? 0 : 1
source = "../../../modules/net-vpc-firewall"
project_id = module.project.project_id
network = module.vpc-local[0].name
default_rules_config = {
disabled = true
}
ingress_rules = {
dataflow-ingress = {
description = "Dataflow service."
direction = "INGRESS"
action = "allow"
sources = ["dataflow"]
targets = ["dataflow"]
ranges = []
use_service_accounts = false
rules = [{ protocol = "tcp", ports = ["12345-12346"] }]
extra_attributes = {}
}
}

}

module "cloudnat" {
count = local.use_shared_vpc ? 0 : 1
source = "../../../modules/net-cloudnat"
project_id = module.project.project_id
region = var.region
name = "default"
javiergp marked this conversation as resolved.
Show resolved Hide resolved
router_network = module.vpc-local[0].self_link
}

module "project" {
source = "../../../modules/project"
name = var.project_id
parent = try(var.project_create.parent, null)
billing_account = try(var.project_create.billing_account_id, null)
project_create = var.project_create != null
prefix = var.prefix
group_iam = var.group_iam
iam = {
"roles/aiplatform.user" = [module.service-account-mlops.iam_email]
"roles/artifactregistry.reader" = [module.service-account-mlops.iam_email]
"roles/artifactregistry.writer" = [module.service-account-github.iam_email]
"roles/bigquery.dataEditor" = [module.service-account-mlops.iam_email]
"roles/bigquery.jobUser" = [module.service-account-mlops.iam_email]
"roles/bigquery.user" = [module.service-account-mlops.iam_email]
"roles/cloudbuild.builds.editor" = [
module.service-account-mlops.iam_email,
module.service-account-github.iam_email
]

"roles/cloudfunctions.invoker" = [module.service-account-mlops.iam_email]
"roles/dataflow.developer" = [module.service-account-mlops.iam_email]
"roles/dataflow.worker" = [module.service-account-mlops.iam_email]
"roles/iam.serviceAccountUser" = [
module.service-account-mlops.iam_email,
"serviceAccount:${module.project.service_accounts.robots.cloudbuild}"
]
"roles/monitoring.metricWriter" = [module.service-account-mlops.iam_email]
"roles/run.invoker" = [module.service-account-mlops.iam_email]
"roles/serviceusage.serviceUsageConsumer" = [
module.service-account-mlops.iam_email,
module.service-account-github.iam_email
]
"roles/storage.admin" = [
module.service-account-mlops.iam_email,
module.service-account-github.iam_email
]
}
labels = var.labels

org_policies = {
# "constraints/compute.requireOsLogin" = {
# enforce = false
# }
# Example of applying a project wide policy, mainly useful for Composer 1
}

service_encryption_key_ids = {
bq = [try(local.service_encryption_keys.bq, null)]
compute = [try(local.service_encryption_keys.compute, null)]
cloudbuild = [try(local.service_encryption_keys.storage, null)]
notebooks = [try(local.service_encryption_keys.compute, null)]
storage = [try(local.service_encryption_keys.storage, null)]
}
services = var.project_services


shared_vpc_service_config = local.shared_vpc_project == null ? null : {
attach = true
host_project = local.shared_vpc_project
}

}

module "service-account-mlops" {
javiergp marked this conversation as resolved.
Show resolved Hide resolved
source = "../../../modules/iam-service-account"
name = var.sa_mlops_name
project_id = module.project.project_id
iam = {
"roles/iam.serviceAccountUser" = [module.service-account-github.iam_email]
}
}

resource "google_project_iam_member" "shared_vpc" {
count = local.use_shared_vpc ? 1 : 0
project = var.network_config.host_project
role = "roles/compute.networkUser"
member = "serviceAccount:${module.project.service_accounts.robots.notebooks}"
}


resource "google_sourcerepo_repository" "code-repo" {
count = var.repo_name == null ? 0 : 1
name = var.repo_name
project = module.project.project_id
}


Loading