For this project I decided to use snowflake and dbt given the amount of data that needs to be processed (856m rows).
Python is used for statistical analysis and plotting, where the data is retrieved by the snowpark python connector.
The overall architecture diagram can be seen below:
Snowflake's Free tier comes with $400 worth of credits, which is more than enough for this project.
The first step is to parse the data from the ES.h5
file into multiple smaller parquet files which is required by Snowflake.
This is done by the scripts/convert_h5_to_parquet_chunks.py script which takes about 1 hour to run.
The data is then uploaded to an S3 bucket by the scripts/upload_parquet_chunks_to_s3.py script.
In practice, these functions could run in AWS Batch / ECS given the size of the data.
Note: Although two datasets were found in the
ES.h5
file, only thetick/trades_filter0vol
dataset was loaded into s3. The reasoning for this is that thetick/trades
dataset contains 0 volume trades, which are not needed for this analysis. But it's debatable whether we may still want to load all the data in the data warehouse for further analysis. Note2: There may be a simpler way of parsing the data, the currently script seems overly complex.
Next step is to load the data into Snowflake, which is done through an s3 integration object and external stage (see infra DDL statements).
The external table, which serves as a reference to the S3 location, is created during the make setup
process, and can be found in the SOURCE schema.
The raw data is then loaded into the stgmock_providerequity_index_future__tick_data staging table as part of running the dbt pipeline.
This is the hierarchy of the tables, where the data flows from source -> staging -> transform -> fact schemas.
models/
└── equity_index_future
├── fact
│ ├── timeseries__es_equity_index_future__bar_returns_correlation.sql
│ ├── timeseries__es_equity_index_future__monthly_bar_variance.sql
│ └── timeseries__es_equity_index_future__weekly_bar_counts.sql
├── staging
│ ├── config.yml
│ ├── mock_provider
│ │ └── sources.yml
│ └── stg__mock_provider__equity_index_future__tick_data.sql
└── transform
├── int__es_equity_index_future__continuous.sql
├── int__es_equity_index_future__dollars_traded_bars.sql
├── int__es_equity_index_future__tick_bars.sql
└── int__es_equity_index_future__volume_bars.sql
The model lineage can be seen in the diagram below:
This diagram is a screenshot from the dbt docs site, which can be viewed by running make serve-docs
.
The analysis is done in the analysis directory, where the data is retrieved by the snowpark python connector and plots are generated.
These should answer the questions in docs/analysis.md documentation.