Skip to content

Refactoring my Strava pipeline to use dlt, dagster, duckdb, and dbt-core

Notifications You must be signed in to change notification settings

jairus-m/dagster-dlt

Repository files navigation

The Analytics Development Lifecycle within a Modern Data Engineering Framework

Utilizing dltHub, dbt, + dagster as a framework for developing data products with software engineering best practices.

Slide1

While the short-term goal is to learn these tools, the greater goal is to understand and flesh out what the full development and deployment cycle can look like for orchestrating a data platform and deploying custom pipelines. There is a great process using dbt where we have local development, testing, versioning/branching, CICD, code-review, separation of dev and prod, project structure/cohesion etc., but how can we apply that to the entire data platform and espeacially, the 10-20% of ingestion jobs that cannot be done in a managed tool like Airbyte and/or is best done using a custom solution?

Current Status

Screenshot 2024-12-13 at 11 00 14 PM
  • Built a dltHub EL pipeline via the RESTAPIConfig class in dagster_proj/assets/activities.py
    • Declaratively extracts my raw activity data from Strava's REST API and loads it into DuckDB
    • Created a custom configurable resource for Strava API - #5, #11
  • Built a dbt-core project to transform the activities data in analytics_dbt/models
  • Orchestrated ingest, transformation, and downstream dependecies (ML) with Dagster - #2, #6
    • Developed in dev environment and materaizlied in dagster dev server
    • Configured resources / credentials in a root .env file
    • Current Dagster folder structure (dependencies managed by UV)
      • One code location: dagster_proj/
        • Assets: dagster_proj/assets/
        • Resources: dagster_proj/resources/__init__.py
        • Jobs: dagster_proj/jobs/__init__.py
        • Schedules: dagster_proj/schedules/__init__.py
        • Definitions: dagster_proj/__init__.py
      • The structure is experimental and based on the DagsterU courses
  • Created an Sklearn ML pipeline to predict energy expenditure for a given cycling activity
    • WIP but the general flow of preprocessing, building the ML model, training, testing/evaluation, and prediction can be found in dagster_proj/assets/energy_prediction.py
    • This a downstream dependency of a dbt asset materialized in duckdb

Deployment Status

  • Officially Deployed this project to Dagster+ !!!
    • CICD w/ branching deployments for every PR
  • Seperated execution environments - #13
    • dev
    • branch
    • prod
  • Added ruff Python linter - #8
  • Astral uv for Python dependency management - #1

TODO:

  • Add unit tests
  • Add additional CI checks to run unit tests, Python linting, etc
  • Beef up the ML pipeline with dagster-mlflow for experiment tracking, model versioning, better model observability, etc
  • Add new Strava end points / dbt models / downstream analytics assets
  • Implement partitions/backfilling with dlt/Dagster

Getting Started:

For local development only:

  1. Clone this repo locally
  2. Create a .env file at the root of the directory:
# these are the config values for local dev and will change in branch/prod deployment
DBT_TARGET=dev
DAGSTER_ENVIRONMENT=dev
DUCKDB_DATABASE=data/dev/strava.duckdb

#strava
CLIENT_ID= 
CLIENT_SECRET=
REFRESH_TOKEN=
  1. Download uv and run uv sync
  2. Build the Python package in developer mode via uv pip install -e ".[dev]"
  3. Run the dagster daemon locally via dagster dev
  4. Materialize the pipeline!

Additional Notes:

  • The refresh_token in the Strava UI produces an access_token that is limited in scope. Please follow these Strava Dev Docs to generate the proper refresh_token which will then produce an access_token with the proper scopes.
  • If you want to run the dbt project locally, outside of dagster, you need to add a DBT_PROFILES_DIR environment variable to the .env file and export it
    • For example, my local env var is: DBT_PROFILES_DIR=/Users/jairusmartinez/Desktop/dlt-strava/analytics_dbt
    • Yours will be: DBT_PROFILES_DIR=/PATH_TO_YOUR_CLONED_REPO_DIR/analytics_dbt

About

Refactoring my Strava pipeline to use dlt, dagster, duckdb, and dbt-core

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages