This repository contains demo data-pipelines orchestrated by Google Cloud Workflows.
Pipelines:
- Word Count (event driven processing)
- Text files arrives to a GCS bucket
- Object notification is sent to PubSub
- PubSub push-subscription triggers a Cloud Function
- Cloud Function triggers a Cloud Workflow with the necessary arguments
- Cloud Workflow encapsulates all steps to process this file and execute them:
- Submits a serverless Spark job to Dataproc. The job count the words and store the results per file in a BigQuery table
- Pool the job status and wait to report success or failure
- If the Spark job succeeds, it triggers a SQL step to BigQuery to re-construct an aggreagte table
All resources are defined and deployed by a Terraform module.
export PROJECT_ID=
export COMPUTE_REGION=
export [email protected]
export BUCKET=${PROJECT_ID}-terraform
export TF_SA=terraform
Create (or activate) a gcloud account for that project
export CONFIG=workflows-sandbox
gcloud config configurations create $CONFIG
gcloud config set project $PROJECT_ID
gcloud config set account $ACCOUNT
gcloud config set compute/region $COMPUTE_REGION
Auth gcloud
gcloud auth login --project $PROJECT_ID
gcloud auth application-default login --project $PROJECT_ID
gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION -b on gs://$BUCKET
Terraform needs to run with a service account to deploy DLP resources. User accounts are not enough.
./scripts/prepare_terraform_service_account.sh
./scripts/enable_gcp_apis.sh
The solution is deployed by Terraform and thus all configurations are done on the Terraform side.
Create a new .tfvars file and override the variables in the below sections. You can use the example tfavrs files as a base example-variables.
export VARS=variables.tfvars
Most required variables have default values defined in variables.tf. One can use the defaults or overwrite them in the newly created .tfvars.
Both ways, one must set the below variables:
project = "<GCP project ID to deploy solution to (equals to $PROJECT_ID) >"
compute_region = "<GCP region to deploy compute resources e.g. cloud run, iam, etc (equals to $COMPUTE_REGION)>"
data_region = "<GCP region to deploy data resources (buckets, datasets, tag templates, etc> (equals to $DATA_REGION)"
Terraform needs to run with a service account to deploy DLP resources. User accounts are not enough.
This service account name is defined in the "Setup Environment Variables" step and created in the "Prepare Terraform Service Account" step. Use the full email of the created account.
terraform_service_account = "${TF_SA}@${PROJECT_ID}.iam.gserviceaccount.com"
Terraform needs to run with a service account to deploy DLP resources. User accounts are not enough.
./scripts/deploy_terraform.sh
- This repo is using the default VPC of the target project
- To use Dataproc with private IPs one must enable Google Private Access on the subnetwork. This is done via
gcloud compute networks subnets update <SUBNETWORK> \
--region=<REGION> \
--enable-private-ip-google-access
- On the customer project, a shared VPC is expected and the subnetwork has to enable Google Private Access
To execute the wordcount pipeline, upload a text file to the data
bucket created by Terraform
gsutil cp gs://pub/shakespeare/rose.txt gs://<project>-data-wordcount/landing/rose.txt
The wordcount Cloud Workflow will populate BigQuery tables sandbox.word_count_output
and sandbox.word_count_aggregate
Run the below query to track the progress of file processing steps across the different GCP components:
SELECT * FROM monitoring.v_global_tracker