This is a dbt
, DuckDB
and Python
project which uses the ATP Tennis Rankings, Results, and Stats dataset that is publicly available on Github, to demonstrate how to structure a dimensional model suitable for analytics. Another helpful resource is this Youtube video with regard to how to get the data into a DuckDB database.
For this project make sure you have following installed:
Clone the ATP Tour project to somewhere on your local directory
git clone https://github.com/achilala/dbt-atp-tour.git
cd dbt-atp-tour
Create a virtual environment for your python dependencies and activate it. Your python dependencies will be installed here.
python3 -m venv .venv/dbt-atp-tour
source .venv/dbt-atp-tour/bin/activate
Install the python dependencies
pip install -r requirements.txt
Some of the installed modules require reactivating your virtual environment to take effect
deactivate
source .venv/dbt-atp-tour/bin/activate
Running this python script will read the ATP tour data into a duckdb database called atp_tour.duckdb
python3 download_atp_tour_data.py
A new file called atp_tour.duckdb
should appear in the folder, and this is the DuckDB
database file. The reason why DuckDB
is my-go-to database is because it's very light, simple to setup and built for analytical processing.
Before running dbt
makes sure the profiles.yml
file is setup corrently. The path
in the file should point to your duckdb database and on mine it looks something like this atp_tour.duckdb
.
Test your connection and adjust your profiles.yml
settings accordingly until you get a successful test.
dbt debug
Run the dbt project to build your models for analysis
dbt clean && dbt deps && dbt build
To generate and view the project documentation run the following.
dbt docs generate && dbt docs serve
A browser should open with the docs site or click here. To cancel press the following on the keyboard
^c
Use the DuckDB CLI
to query the DuckDB
database. If you don't already have DuckDB v0.8.1 or higher installed then proceed to do the following:
Download and unzip the CLI
curl -OL https://github.com/duckdb/duckdb/releases/download/v0.9.1/duckdb_cli-osx-universal.zip
unzip duckdb_cli-osx-universal.zip
Open the database using the downloaded DuckDB CLI like this
./duckdb --readonly atp_tour.duckdb
And if you already have DuckDB install then open the database like this
duckdb --readonly atp_tour.duckdb
To sample the players data try the following
summarize mart.dim_player;
select *
from mart.dim_player
order by dim_player_key;
Now that the data is modeled, we can now run the streamlit app
streamlit run atp_tour_app.py
For BI and analysing the data we'll use Metabase
, run Docker to setup it up
docker-compose up --build
To tear down the Metabase
setup and start again
docker-compose down
Configure Metabase
as follows:
Add your data: DuckDB
Display name: ATP Tour
Database file: /home/db/atp_tour.duckdb
You're now ready to browse the data! Raw data is in the raw
schema and the final dimensional model is in mart
For consuming the data from other apps, run the following command to start the API service
uvicorn atp_tour_api:app --reload