The Analytics Development Lifecycle within a Modern Data Engineering Framework

Utilizing dltHub, dbt, + dagster as a framework for developing data products with software engineering best practices.

While the short-term goal is to learn these tools, the greater goal is to understand and flesh out what the full development and deployment cycle can look like for orchestrating a data platform and deploying custom pipelines. There is a great process using dbt where we have local development, testing, versioning/branching, CICD, code-review, separation of dev and prod, project structure/cohesion etc., but how can we apply that to the entire data platform and espeacially, the 10-20% of ingestion jobs that cannot be done in a managed tool like Airbyte and/or is best done using a custom solution?

Current Status

Built a dltHub EL pipeline via the RESTAPIConfig class in dagster_proj/assets/activities.py
- Declaratively extracts my raw activity data from Strava's REST API and loads it into DuckDB
- Created a custom configurable resource for Strava API - #5, #11
Built a dbt-core project to transform the activities data in analytics_dbt/models
Orchestrated ingest, transformation, and downstream dependecies (ML) with Dagster - #2, #6
- Developed in dev environment and materaizlied in dagster dev server
- Configured resources / credentials in a root .env file
- Current Dagster folder structure (dependencies managed by UV)
  - One code location: dagster_proj/
    - Assets: dagster_proj/assets/
    - Resources: dagster_proj/resources/__init__.py
    - Jobs: dagster_proj/jobs/__init__.py
    - Schedules: dagster_proj/schedules/__init__.py
    - Definitions: dagster_proj/__init__.py
  - The structure is experimental and based on the DagsterU courses
Created an Sklearn ML pipeline to predict energy expenditure for a given cycling activity
- WIP but the general flow of preprocessing, building the ML model, training, testing/evaluation, and prediction can be found in dagster_proj/assets/energy_prediction.py
- This a downstream dependency of a dbt asset materialized in duckdb

Deployment Status

Officially Deployed this project to Dagster+ !!!
- CICD w/ branching deployments for every PR
Seperated execution environments - #13
- dev
- branch
- prod
Added ruff Python linter - #8
Astral uv for Python dependency management - #1

TODO:

Add unit tests
Add additional CI checks to run unit tests, Python linting, etc
Beef up the ML pipeline with dagster-mlflow for experiment tracking, model versioning, better model observability, etc
Add new Strava end points / dbt models / downstream analytics assets
Implement partitions/backfilling with dlt/Dagster

Getting Started:

For local development only:

Clone this repo locally
Create a .env file at the root of the directory:

# these are the config values for local dev and will change in branch/prod deployment
DBT_TARGET=dev
DAGSTER_ENVIRONMENT=dev
DUCKDB_DATABASE=data/dev/strava.duckdb

#strava
CLIENT_ID= 
CLIENT_SECRET=
REFRESH_TOKEN=

Download uv and run uv sync
Build the Python package in developer mode via uv pip install -e ".[dev]"
Run the dagster daemon locally via dagster dev
Materialize the pipeline!

Additional Notes:

The refresh_token in the Strava UI produces an access_token that is limited in scope. Please follow these Strava Dev Docs to generate the proper refresh_token which will then produce an access_token with the proper scopes.
If you want to run the dbt project locally, outside of dagster, you need to add a DBT_PROFILES_DIR environment variable to the .env file and export it
- For example, my local env var is: DBT_PROFILES_DIR=/Users/jairusmartinez/Desktop/dlt-strava/analytics_dbt
- Yours will be: DBT_PROFILES_DIR=/PATH_TO_YOUR_CLONED_REPO_DIR/analytics_dbt

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
analytics_dbt		analytics_dbt
dagster_proj		dagster_proj
dagster_proj_tests		dagster_proj_tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
dagster_cloud.yaml		dagster_cloud.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Analytics Development Lifecycle within a Modern Data Engineering Framework

Current Status

Deployment Status

TODO:

Getting Started:

About

Releases

Packages

Languages

jairus-m/dagster-dlt

Folders and files

Latest commit

History

Repository files navigation

The Analytics Development Lifecycle within a Modern Data Engineering Framework

Current Status

Deployment Status

TODO:

Getting Started:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages