ETL Process for Currency Quotes Data" project is a complete solution dedicated to extracting, transforming and loading (ETL) currency quote data. This project uses several advanced techniques and architectures to ensure the efficiency and robustness of the ETL process.
See the following docs:
-
MVC Architecture: Implementation of the Model-View-Controller (MVC) architecture, separating business logic, user interface and data manipulation for better organization and code maintenance.
-
Comprehensive Testing: Development of tests to ensure the quality and robustness of the code at various stages of the ETL process
-
Parallelism in Models: Use of parallelism in the data transformation and loading stages, increasing efficiency and reducing processing time.
-
Fire-Forget Messaging: Use of messaging (queue.queue) in the fire-forget model to manage files generated between the transformation and loading stages, ensuring a continuous and efficient data flow.
-
Parameter Validation: Sending valid parameters based on the request data source itself, ensuring the integrity and accuracy of the information processed.
-
Configuration Management: Use of a configuration module to manage endpoints, retry times and number of attempts, providing flexibility and ease of adjustment.
-
Common Module: Implementation of a common module for code reuse across the project, promoting consistency and reducing redundancies.
-
Dynamic Views: Generation of views with index.html using nbConvert, based on consolidated data from a Jupyter Notebook that integrates the generated files into a single dataset for exploration and analysis.
- Extraction: A single request is made to a specific endpoint to obtain quotes from multiple currencies.
- Transformation: The request response is processed, separating each currency quote and storing it in individual files in Parquet format, facilitating data organization and retrieval.
- Upload: Individual Parquet files are consolidated into a single dataset using a Jupyter Notebook, allowing for comprehensive analysis and valuable insights into currency quotes.
In summary, this project offers a robust and efficient solution for collecting, processing and analyzing currency quote data, using advanced architecture and parallelism techniques to optimize each step of the ETL process.
Repository structure
data/
: Stores raw data in Parquet format.- ETH-EUR-1713658884.parquet: Example: Raw data for ETH-EUR quotes. file_name = symbol + extraction unix timestamp
notebooks/
: Contains thedata_explorer.ipynb
notebook for data exploration.etl/
: Contains the project's source code.run.py
: Entrypoint of the application
common/
: Library for code reuse and standardization.controller/
pipeline.py
: Receives data extraction requests and orchestrates ETL models .
models/
:extract/
api_data_extractor.py
: Receives the parameters from the controller, sends the request and returns in JSON.
transform/
publisher.py
: Receives the JSON from the extractor, separates the dictionary by currency and publishes each of them to a queue to be processed individually.
load/
parquet_loader.py
: In a separate thread, receive a new dictionary from queue that the transformer is publishing and generates .parquet files in the default directory.
views/
: For storing data analysis and visualization.
Ensure Python 3.10 or higher is installed on your machine
- Clone the repository:
$ git clone https://github.com/ivdatahub/data-consumer-api.git
- Go to directory
$ cd data-consumer-api
- Install dependencies and execute project
$ poetry install && poetry run python etl/run.py
Learn more about poetry
You can see the complete data analysis, the Jupyter Notebook is deployed in GitHub Pages