ETL Pipeline on Movielens-25m Dataset

Overview

This project demonstrates an ETL pipeline using modern data engineering tools and platforms which can be applied on any dataset

Dataset

The chosen dataset for this project is the Movielens-25m. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users.

Architecture

Cloud resources are initialized using terraform
An Airflow workflow ingests the data from the movielens website into the cloud
The datasets are uploaded to the cloud as raw data (parquet format) into the data lake and as tables in BigQuery for processing
A simple SQL transformation is applied using dbt on the ratings table by clustering on the movieId column, this will allow reduction of query time from 8-10 seconds to under 1 second
All data sources are then connected to Data Studio for visualization

Tools and Platforms

This project makes extensive use of the Google Cloud Platform:

Compute Engine: VM for development
Bigquery: data warehouse
Cloud Storage: data lake and raw data storage
Data Studio: dashboard and visualization

Other tools used for the project

Terraform: Infrastructure-as-code (IAC)
Airflow: Workflow orchestraction
Docker: Containerization
Data Build Tool (DBT): SQL transformations

Dashboard

You may access the dashboard with the visualizations in this link.

Acknowledgement

This project was done as a direct application on the material provided by the great folks at The Data Engineering Zoomcamp.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
airflow		airflow
assets		assets
dbt		dbt
terraform		terraform
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline on Movielens-25m Dataset

Overview

Dataset

Architecture

Tools and Platforms

Dashboard

Acknowledgement

About

Releases

Packages

Languages

ahmedo42/movielens-etl

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline on Movielens-25m Dataset

Overview

Dataset

Architecture

Tools and Platforms

Dashboard

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages