Table of Contents
This project aims to provide set of tools that allow you to easily deploy Snowplow setup on Google Cloud Platform.
After following all those steps you should have:
- GKE cluster running:
- Snowplow Scala Stream Collector
- Beam Enrich
- BigQuery Loader
- Pub/Sub topics for collector and enrich stream
- BigQuery dataset being the final destination of Snowplow events
- Few GCS buckets
NOTE: This project is still work in progress and some part may not work yet but you are welcomed to help!
To manage GCP resources you need installed gcloud
CLI. For installation options
check the official documentation.
This project uses Terraform to bootstrap the infrastructure and kubectl to manage the Kubernetes cluster. On MacOS you should easily install them using Homebrew:
brew install terraform
brew install kubectl
For install option on other systems please check documentation of those projects.
-
Create GCP project. You can also use already existing one.
-
Run the following commands:
export PROJECT_ID=project-name-here export SERVICE_ACCOUNT_NAME=snowplow bash scripts/setup-iam.sh ${PROJECT_ID} ${SERVICE_ACCOUNT_NAME}
This will create service account in
keys
directory. This service account will haveroles/editor
role and will be used to create GCP resources. This script will also enable required services (GKE). -
To bootstrap infrastructure required for Snowplow deployment run:
export LOCATION=europe-west3 export GCP_KEY=keys/${SERVICE_ACCOUNT_NAME}.json export CLIENT=client-name terraform apply -var "gcp_project=${PROJECT_ID}" -var "gcp_location=${LOCATION}" -var "gcp_key_admin=${GCP_KEY}" -var "client=${CLIENT}"
The
CLIENT
is a string that is added to all resources name. It's recommended to use terraform workspaces i.e.terraform workspace new my_snowplow
.
At this moment all required elements should be up and running. If you wish you can check this in GCP console. In next steps you will deploy the Snowplow components.
Check snowplow documentation.
To get access to the newly create kubernetes cluster run
gcloud container clusters get-credentials "snowplow-gke" --region ${LOCATION}
Collector configuration requires user to provide GCP project id. You can do this running the following substitution:
sed -i "" "s/googleProjectId =.*/googleProjectId = ${PROJECT_ID}/" k8s/collector/conf.yaml
Then deploy the following CRDs:
kubectl apply -f k8s/collector/conf.yaml
kubectl apply -f k8s/collector/deploy.yaml
kubectl apply -f k8s/collector/service.yaml
This will create snowplow-collector
deployment which uses official snowplow image.
To check if the deployment works run
kubectl get pods -A | grep snowplow
and you should see few pods, all in Running
state. To verify that everything works smoothly
you can run health check script:
bash scripts/collector_health_check.sh
If there was no error, head to PubSub web console and after few seconds you should observe
some events in the good
topic.
Check snowplow documentation.
The next step is to start streaming job on Google Dataflow (Apache Beam). To do this you will use one time kubernetes job.
But before that enrich configuration requires you to provide GCP project id. You can do this running the following substitution:
sed -i "" "s/googleProjectId =.*/googleProjectId = ${PROJECT_ID}/" k8s/enrich/conf.yaml
sed -i "" "s/\*PROJECT\*/${PROJECT_ID}/" k8s/enrich/job.yaml # does not work
Then we need a key to write to GCS:
cp keys/snowplow-admin.json keys/credentials.json
kubectl create secret generic gcs-writer-sa --from-file keys/credentials.json
TODO: there should be key with limited scope - what scope?. TODO: some more configuration changes are needed
Once you configuration is ready run:
kubectl apply -f k8s/enrich/conf.yaml
kubectl apply -f k8s/enrich/job.yaml
After few seconds run:
kubectl get jobs -A
and you should see that snowplow-enrich
has completed.
Check snowplow documentation.
We welcome all contributions! Please submit an issue or PR no matter if it's bug or a typo.
This project is using pre-commits to ensure the quality of the code. To install pre-commits just do:
pip install pre-commit
# or
brew install pre-commit
And then from project directory run pre-commit install
.