Realtime-Data-Streaming

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage, from data ingestion to processing and storage, using a robust tech stack that ensures scalability and efficiency.

Project Components

Data Source

Randomuser.me API: Provides random user data to simulate real-world data ingestion for the pipeline.

Pipeline Orchestration

Apache Airflow: Manages the workflow, orchestrating data ingestion and storing the fetched data in a PostgreSQL database.

Real-Time Data Streaming

Apache Kafka: Streams data from PostgreSQL to the data processing engine.
Apache Zookeeper: Ensures distributed synchronization and coordination for Kafka.

Data Processing

Apache Spark: Processes the streamed data, leveraging its distributed computing capabilities.

Data Storage

Cassandra: Stores the processed data for efficient querying and analysis.
PostgreSQL: Used for intermediate storage of ingested data before streaming.

Key Features

Data Ingestion: Automates fetching data from an external API using Apache Airflow.
Streaming: Facilitates real-time data transfer with Apache Kafka.
Processing: Implements distributed data processing with Apache Spark.
Storage: Combines PostgreSQL and Cassandra for comprehensive storage solutions.

What You'll Learn

Setting up a data pipeline with Apache Airflow.
Real-time data streaming with Apache Kafka.
Distributed synchronization with Apache Zookeeper.
Data processing techniques with Apache Spark.
Data storage solutions with Cassandra and PostgreSQL.

Tech Stack

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
PostgreSQL

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
checkpoint		checkpoint
dags		dags
test		test
README.md		README.md
apache_stream.py		apache_stream.py
fetch_db.py		fetch_db.py
log.log		log.log
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Realtime-Data-Streaming

Project Components

Data Source

Pipeline Orchestration

Real-Time Data Streaming

Data Processing

Data Storage

Key Features

What You'll Learn

Tech Stack

About

Releases

Packages

Languages

pinilDissanayaka/Realtime-Data-Streaming-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Realtime-Data-Streaming

Project Components

Data Source

Pipeline Orchestration

Real-Time Data Streaming

Data Processing

Data Storage

Key Features

What You'll Learn

Tech Stack

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages