This project is designed to simulate a real-time data pipeline.
It generates real time fake data like user and song in Spotify, sends this data as a stream through a Kafka server, processes the data in real-time using PySpark(Structured Streaming), writes the processed data to a Cassandra database and visualize the results in Grafana.
The simulation includes data for the user (such as name, address, age, and nationality), the song being played (including artist, album name, track name, track duration, and genre), and the time of play.
The data processing includes selecting certain fields, transforming some of the data, and computing the average age of listeners for each artist.
In a real-world scenario, you would likely have a consistent stream of data coming from a source such as a web app, IoT devices, or logs. However, setting up such a source for the purpose of demonstrating or testing a data pipeline can be challenging, especially when you want a large volume of data. You might not have access to enough real data, or there could be privacy issues with using real user data.
This is why Faker comes in. Faker is a Python library that generates fake data. You can use it to generate data that mimics a variety of real-world data types. By using Faker, you can easily generate a large volume of realistic-looking data for your data pipeline.
-
Control over the volume of data
You can generate as much or as little data as you need, simply by running the Faker function more or fewer times.
-
Privacy
Since the data is all fake, there are no privacy concerns.
-
Variety
Faker can generate a wide range of data types, allowing you to simulate a wide range of real-world scenarios.
-
Consistency
The data generated by Faker is consistent in format, making it easier to process.
This project is set up to run in a dockerized environment, making it easy to get up and running.
Before starting, ensure you have the following installed:
- Python 3.10
- Pipenv
- Docker and Docker Compose
-
Clone the repository and run the following command
git clone https://github.com/OZOOOOOH/Fake-Streaming-Pipeline.git cd Fake-Streaming-Pipeline
-
Set up Python virtual environment
2-1. Install all necessary dependencies
pipenv install
2-2. Activate the new virtual environment
pipenv shell
-
Start Docker containers
docker-compose up -d
-
Run the application
python src/main.py
If you follow above descriptions, You can monitor the Kafka server and the Cassandra database to see the data in real-time. You can also query the Cassandra database to analyze the processed data.
To monitor the Kafka server
To watch the grafana dashboard
Class SpotifyStreamingProcessor in process.py reads the data from the Kafka stream, processes it, and writes it to Cassandra.
The processing involves selecting certain fields, transforming some of the data, and computing the average listener age for each artist. The processed data and the average age of listeners by artist are then written to Cassandra.
- Add Batch Processing Pipelines
- Enhance the structure of the application to handle larger volumes of data.
- Adapt the application to run on cloud services such as AWS.
- Implement a batch processing pipeline.
- Extend the application to handle data from multiple sources.
- Monitor cassandra and kafka containers using grafana