This document describes the full process from making a change to your pipeline code, all the way through to having the code running in production.
- Suitable GCP environments set up - see the README section on Deploying Cloud Infrastructure
- This repo forked / used as a template for a new GitHub repo
- CI/CD set up - see the instructions here
- Access set up for the BigQuery datasets used in the example pipelines
- Git repo cloned locally (or in a notebook environment) and local setup complete - see here
- Create a feature branch off the main/master branch:
git checkout -b my-feature-branch
- Make changes to your pipeline code locally (see main README and USAGE.md for more details)
- Commit these changes to your feature branch
- Push your feature branch to GitHub
- Open a Pull Request (PR) from your feature branch to the main/master branch
When you open the Pull Request, the CI pipeline (pr-checks.yaml
) should be triggered to run pre-commit checks, unit tests, and compile the training and prediction pipelines. Your E2E checks can also be triggered on the Pull Request (using the /gcbrun
comment command). Once you are happy with your Pull Request (usual peer review etc), merge it into the main/master branch.
💡 Remember |
---|
Make sure to update any unit tests and end-to-end tests in line with your changes to the pipelines |
To compile and publish your ML pipelines into your test and prod environments, you will need to create a release.
When the new tag is created, the release.yaml
pipeline should be triggered. It will compile both the training and prediction pipelines, then copy the ML pipelines and their assets to the Cloud Storage locations specified in your CI/CD setup under a folder with the name of your git tag.
- You are using the
xgboost
template - You create a release from the main/master branch and use the git tag
v1.2
- You have set up the following Cloud Build variables / substitutions for the
release.yaml
CI/CD pipeline_PIPELINE_PUBLISH_GCS_PATHS
=gs://<Project ID of dev project>-pl-assets gs://<Project ID of test project>-pl-assets gs://<Project ID of prod project>-pl-assets
Assuming your end-to-end tests pass, your compiled training pipeline will be published to:
gs://<Project ID of dev project>-pl-assets/v1.2/training/training.json
gs://<Project ID of test project>-pl-assets/v1.2/training/training.json
gs://<Project ID of prod project>-pl-assets/v1.2/training/training.json
The contents of your assets folder for your training pipeline will be published to:
gs://<Project ID of dev project>-pl-assets/v1.2/training/assets/
gs://<Project ID of test project>-pl-assets/v1.2/training/assets/
gs://<Project ID of prod project>-pl-assets/v1.2/training/assets/
Similarly, your compiled prediction pipeline will be published in this location:
gs://<Project ID of dev project>-pl-assets/v1.2/prediction/prediction.json
gs://<Project ID of test project>-pl-assets/v1.2/prediction/prediction.json
gs://<Project ID of prod project>-pl-assets/v1.2/prediction/prediction.json
The contents of your assets folder for your prediction pipeline will be published in this location:
gs://<Project ID of dev project>-pl-assets/v1.2/prediction/assets/
gs://<Project ID of test project>-pl-assets/v1.2/prediction/assets/
gs://<Project ID of prod project>-pl-assets/v1.2/prediction/assets/
❗ IMPORTANT |
---|
Before your E2E tests can run correctly, you need to make sure that the parameters have been set up correctly for the cloud environment in the relevant pipeline definition files (pipeline.py ). These can inherit from environment variables set in env.sh (for triggering ad hoc) or in your Cloud Build trigger setup (for triggering through CI/CD) |
Now that you have created a release, and the compiled pipelines (and their assets
files) have been copied to the test and prod environments, you can now schedule your pipelines to run in those environments.
Of course, we will begin by scheduling the pipelines to run in the test environment.
Create a new branch off the main/master branch e.g. git checkout -b test-env-scheduling
- Copy the file
terraform/modules/scheduled_pipelines/scheduled_jobs.auto.tfvars.example
toterraform/envs/test/scheduled_jobs.auto.tfvars
- In this file you will see the variable
cloud_schedulers_config
. Here we pass in a map of all the Cloud Scheduler jobs that we want to deploy. Continuing with our example earlier, the code below shows how we can schedule the training pipeline to run on the first of every month, and the prediction pipeline will run every night:
cloud_schedulers_config = {
# Uncomment and amend as required
xgboost_training = {
name = "xgboost-training-pipeline-trigger"
description = "Trigger my XGBoost training pipeline in Vertex"
schedule = "0 0 1 * *"
time_zone = "UTC"
template_path = "gs://<Project ID of test project>-pl-assets/v1.2/training/training.json"
enable_caching = null
pipeline_parameters = {
project_id = <Project ID of test project>
project_location = "europe-west2"
pipeline_files_gcs_path = "gs://<Project ID of test project>-pl-assets/v1.2/training/assets"
ingestion_project_id = <Project ID of test project>
model_name = "xgboost-with-preprocessing"
model_label = "label_name"
tfdv_schema_filename = "tfdv_schema_training.pbtxt"
tfdv_train_stats_path = "gs://<Project ID of test project>-pl-root/train_stats/train.stats"
dataset_id = "preprocessing"
dataset_location = "europe-west2"
ingestion_dataset_id = "chicago_taxi_trips"
timestamp = "2022-12-01 00:00:00"
},
},
xgboost_prediction = {
name = "xgboost-prediction-pipeline-trigger"
description = "Trigger my XGBoost prediction pipeline in Vertex"
schedule = "0 0 * * *"
time_zone = "UTC"
template_path = "gs://<Project ID of test project>-pl-assets/v1.2/prediction/prediction.json"
enable_caching = null
pipeline_parameters = {
project_id = <Project ID of test project>
project_location = "europe-west2"
pipeline_files_gcs_path = "gs://<Project ID of test project>-pl-assets/v1.2/prediction/assets"
ingestion_project_id = <Project ID of test project>
model_name = "xgboost-with-preprocessing"
model_label = "label_name"
tfdv_schema_filename = "tfdv_schema_prediction.pbtxt"
tfdv_train_stats_path = "gs://<Project ID of test project>-pl-root/train_stats/train.stats"
dataset_id = "preprocessing"
dataset_location = "europe-west2"
ingestion_dataset_id = "chicago_taxi_trips"
timestamp = "2022-12-01 00:00:00"
batch_prediction_machine_type = "n1-standard-4"
batch_prediction_min_replicas = 3
batch_prediction_max_replicas = 5
},
},
}
- Commit these change to your branch, and push the branch to GitHub
- Open a Pull Request from this branch to the main/master branch. As part of the CI checks (in Cloud Build), you should see a Terraform plan that describes the changes you have made to the Terraform config
- Merge the PR to deploy the Cloud Scheduler jobs
Once you are happy with how v1.2
is working in the test environment, you can follow the same process for the prod environment (using terraform/envs/prod
, swapping the necessary values out for the different environment e.g. bucket names).