This project demonstrates an ETL pipeline using modern data engineering tools and platforms which can be applied on any dataset
The chosen dataset for this project is the Movielens-25m. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users.
- Cloud resources are initialized using terraform
- An Airflow workflow ingests the data from the movielens website into the cloud
- The datasets are uploaded to the cloud as raw data (parquet format) into the data lake and as tables in BigQuery for processing
- A simple SQL transformation is applied using dbt on the ratings table by clustering on the movieId column, this will allow reduction of query time from 8-10 seconds to under 1 second
- All data sources are then connected to Data Studio for visualization
This project makes extensive use of the Google Cloud Platform:
- Compute Engine: VM for development
- Bigquery: data warehouse
- Cloud Storage: data lake and raw data storage
- Data Studio: dashboard and visualization
Other tools used for the project
- Terraform: Infrastructure-as-code (IAC)
- Airflow: Workflow orchestraction
- Docker: Containerization
- Data Build Tool (DBT): SQL transformations
You may access the dashboard with the visualizations in this link.
This project was done as a direct application on the material provided by the great folks at The Data Engineering Zoomcamp.