This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage, from data ingestion to processing and storage, using a robust tech stack that ensures scalability and efficiency.
- Randomuser.me API: Provides random user data to simulate real-world data ingestion for the pipeline.
- Apache Airflow: Manages the workflow, orchestrating data ingestion and storing the fetched data in a PostgreSQL database.
-
Apache Kafka: Streams data from PostgreSQL to the data processing engine.
-
Apache Zookeeper: Ensures distributed synchronization and coordination for Kafka.
- Apache Spark: Processes the streamed data, leveraging its distributed computing capabilities.
-
Cassandra: Stores the processed data for efficient querying and analysis.
-
PostgreSQL: Used for intermediate storage of ingested data before streaming.
-
Data Ingestion: Automates fetching data from an external API using Apache Airflow.
-
Streaming: Facilitates real-time data transfer with Apache Kafka.
-
Processing: Implements distributed data processing with Apache Spark.
-
Storage: Combines PostgreSQL and Cassandra for comprehensive storage solutions.
- Setting up a data pipeline with Apache Airflow.
- Real-time data streaming with Apache Kafka.
- Distributed synchronization with Apache Zookeeper.
- Data processing techniques with Apache Spark.
- Data storage solutions with Cassandra and PostgreSQL.
- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL