End-to-End Data Pipeline

This repository contains the code for an end-to-end Arpache Airflow data pipeline with Docker container which extracts data from both csv and Postgres database into S3 buckets (using minio); then processes the data using Apache Spark and loads the data into bucktets. Then utilizes python scripts to analyze and build User Behaviour Metric and stores it in DuckDB (as a warehouse). Finally the data visualization is done using Quarto and Plotly.

Architecture

The architecture of the data pipeline is as follows:

Airflow is used to orchestrate the data pipeline, DAGs.
Postgres is used to store Airflow's metadata and the data to be processed.
DuckDB is acted as a data warehouse to store the processed data.
Quarto with Plotly are used to convert code in markdown format to html files that can be embedded in the app or servered as is.
Apache Spark is used to process the data and run a classification algorithm.
minio: To provide an S3 compatible open source storage system.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
containers		containers
dags		dags
data		data
screenshots		screenshots
tests/dags		tests/dags
.dockerignore		.dockerignore
.env.spark		.env.spark
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Data Pipeline

Architecture

Screenshots

Airflow Running DAG

Airflow running with Docker

Docker

Interactive Dashboard with Quatro and Plotly

About

Releases

Packages

Languages

License

Yer1k/data_pipeline

Folders and files

Latest commit

History

Repository files navigation

End-to-End Data Pipeline

Architecture

Screenshots

Airflow Running DAG

Airflow running with Docker

Docker

Interactive Dashboard with Quatro and Plotly

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages