Sakila Company is a reference project for delivering an analytics platform built solely from open source software for Sakila, a fictional DVD rental company.
$ ./scripts/init.sh
$ docker compose --env-file compose/.env up app-db analytics-dwh dagster-db metabase-db # Ctrl+c after initialized
$ docker compose --env-file compose/.env up
The initialization script downloads required files and generates credential and environment files for every component.
The author had been working as a data engineer and the first data person in a SaaS company in Thailand during 2022-2024. It was gold rush time for big data management tools and platforms. However, working in a developing country as Thailand, it didn't have profit margin enough to use commercial platforms showcased by developed countries. He had to develop and maintain data pipelines and data quality without the nicety of modern tools, as if it was in the pre-data science era.
He doesn't want data engineering to be blamed for being "Cost Centric" by executives anymore. One of effective ways to reduce cost is to host software by yourself as much as possible, so he came up with the project. Any company that struggles with its data platform operating cost can use this as a reference and adapt to fit its budget and team's knowledge level.
This project uses Pagila, a Postgresql adaptation of MySQL Sakila database, as application data.
The project uses ELT pipeline approach, i.e. try to dump data to analytics data storage, such as data warehouse and data lake, as much as possible. The application data is loaded to Clickhouse data warehouse for serving analytics. Dagster stands at the center orchestrating loading, transforming and testing data. Loading and transforming across data storages are powered by dlt library. dbt provides modeling, transforming and testing data in the data warehouse.
Data after normalized into star schema in the data warehouse is served by Metabase as dashboards and analyses.
All of these components are deployed with a monolithic docker compose file.
It will be migrated to Kubernetes to ease deployment and scaling to either on-premise servers or cloud providers.
DataHub will be used as a data catalog for teams to look up.
The project will allow adding and modifying application data and will include more variety of data sources, e.g. web/app analytics tracker and open datasets, to mimic day-to-day operations and enhance analyses.