Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jairus-m authored Dec 21, 2024
1 parent d7eac54 commit 1a19dad
Showing 1 changed file with 19 additions and 7 deletions.
26 changes: 19 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,9 @@ Utilizing dltHub, dbt, + dagster as a framework for developing data products wit

While the short-term goal is to learn these tools, the greater goal is to understand and flesh out what the full development and deployment cycle look like for orchestrating a data platform and deploying custom pipelines. There is a great process using dbt where we have local development, testing, versioning/branching, CICD, code-review, separation of dev and prod, project structure/cohesion etc., but how can we apply that to the entire data platform and espeacially, the 10-20% of ingestion jobs that cannot be done in a managed tool like Airbyte and/or is best done using a custom solution?

# Current Status [12/17/24]
# Current Status
<img width="1512" alt="Screenshot 2024-12-13 at 11 00 14 PM" src="https://github.com/user-attachments/assets/a29f1da9-2d6c-46f7-b3ed-3ed6679c88e0" />

- Officially Deployed this project to Dagster+ !!!
- CICD w/ branching deployments for every PR
- Built a dltHub EL pipeline via the RESTAPIConfig class in `dagster_proj/assets/activities.py`
- Declaratively extracts my raw activity data from Strava's REST API and loads it into DuckDB
- Built a dbt-core project to transform the staged activities data in `analytics_dbt/models`
Expand All @@ -30,10 +28,20 @@ While the short-term goal is to learn these tools, the greater goal is to unders
- WIP but the general flow of preprocessing, building the ML model, training, testing/evaluation, and prediction can be found in `dagster_proj/assets/energy_prediction.py`
- This a downstream dependency of a dbt asset materialized in duckdb

## Deployment Status
- Officially Deployed this project to Dagster+ !!!
- CICD w/ branching deployments for every PR
- Seperated execution environments into ([implementation details](https://github.com/jairus-m/dagster-dlt/pull/9))
- dev
- branch
- prod
- Added `ruff` Python linter

## TODO:
- Concretely seperate dev from prod
- Add unittests
- Incorporate a Python linter (like ruff) to make sure code is standardized, neat, and follow PEP8
- Add unit tests
- Add additional CI checks to run unit tests, Python linting, etc
- Beef up the ML pipeline with `dagster-mlflow` for experiment tracking, model versioning, better model observability, etc
- Add new Strava end points / dbt models

# Getting Started:
1. Clone this repo locally
Expand All @@ -54,4 +62,8 @@ While the short-term goal is to learn these tools, the greater goal is to unders
5. Run the dagster daemon locally via `dagster dev`
6. Materialize the pipeline!

__Note:__ The `refresh_token` in the Strava UI produces an `access_token` that is limited in scope. Please follow these [Strava Dev Docs](https://developers.strava.com/docs/getting-started/#oauth) to generate the proper `refresh_token` which will then produce an `access_token` with the proper scopes.
__Additional Notes:__
- The `refresh_token` in the Strava UI produces an `access_token` that is limited in scope. Please follow these [Strava Dev Docs](https://developers.strava.com/docs/getting-started/#oauth) to generate the proper `refresh_token` which will then produce an `access_token` with the proper scopes.
- If you want to run the dbt project locally, outside of dagster, you need to add a `DBT_PROFILES_DIR` environment variable to the .env file and export it
- For example, my local env var is: `DBT_PROFILES_DIR=/Users/jairusmartinez/Desktop/dlt-strava/analytics_dbt`
- Yours will be: `DBT_PROFILES_DIR=/PATH_TO_YOUR_CLONED_REPO_DIR/analytics_dbt`

0 comments on commit 1a19dad

Please sign in to comment.