Skip to content

Commit

Permalink
Merge pull request #6 from jairus-m/refactor/dbt-models
Browse files Browse the repository at this point in the history
Refactor/dbt models
  • Loading branch information
jairus-m authored Dec 21, 2024
2 parents 179d79a + 003eabe commit 72ac665
Show file tree
Hide file tree
Showing 11 changed files with 60 additions and 185 deletions.
42 changes: 1 addition & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,46 +35,6 @@ While the short-term goal is to learn these tools, the greater goal is to unders
- Add unittests
- Incorporate a Python linter (like ruff) to make sure code is standardized, neat, and follow PEP8

## Goals:
- Learn the dlt library to maximize features such as
- Automated schema inference / evolution from raw json
- Declaratively defining data pipelines
- Type checking / in-flight data validation
- Understanding resources, sources, and other concepts
- Using Python generators, decorators
- Applying SWE best practices to writing ELT code
- Learn duckDB
- As a force-mulitplier and cost-saver for local development
- Portable, feature-rich, FAST, and free
- Learn Dagster
- Addresses main problems that we face as an org for EL / data platform unification:
- Fragmented tooling and organization of ingestion jobs
- No standard deployment or development process for ingestion jobs that cannot be done via Airbyte/Data shares/etc that need to be written in custom code
- No version control, CICD, PR review, collaboration, testing, separate deployment environments, local development, etc
- Custom code does not live in organized repositories
- No orchestration of assets across the entire data pipeline
- No end-to-end observability / centralized monitoring
- No unified view of data platform
- Difficult to assess the health of the platform and debug
- Cannot optimize cost/compute
- Fragility of ELT execution
- Fragmented tooling / development / deployment =
- Low throughput
- Higher costs
- Increased long-term technical debt
- Shitty dev experience


### How Dagster can addreses these problems:
- Declarative and asset-based
- Foundational philosophical/architectural difference with Airflow/Prefect that enables key capabilities that make data engineering teams far more productive
- Python-first with full support of a mature SDLC
- versioning, local development, dev/prod deployment, CICD, branching, code reviews, unified repository organization, etc
- Fully hosted, serverless solution to execute custom ingestion code (hydrib deployments as well!)
- Integrates well with dbt
- All the benefits of having an orchestrator for end-to-end observability, logging, testing, and has a built-in data catalog


# Getting Started:
1. Clone this repo locally
2. Create a `.env` file at the root of the directory:
Expand All @@ -93,4 +53,4 @@ While the short-term goal is to learn these tools, the greater goal is to unders
5. Run the dagster daemon locally via `dagster dev`
6. Materialize the pipeline!

Note: The `refresh_token` in the Strava UI produces an `access_token` that is limited in scope. Please follow these [Strava Dev Docs](https://developers.strava.com/docs/getting-started/#oauth) to generate the proper `refresh_token` which will then produce an `access_token` with the proper scopes.
__Note:__ The `refresh_token` in the Strava UI produces an `access_token` that is limited in scope. Please follow these [Strava Dev Docs](https://developers.strava.com/docs/getting-started/#oauth) to generate the proper `refresh_token` which will then produce an `access_token` with the proper scopes.
14 changes: 5 additions & 9 deletions analytics_dbt/models/marts/_marts_schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,10 @@ version: 2

models:
- name: fct_activities
description: Contains all activity data with dimensions as foreign keys
- name: dim_activity_names
description: Activity names paired with their unique activity ID
description: Contains all activity data with ID FK
- name: dim_activities
description: Activity ID paired with activity dims
- name: dim_dates
description: Date dimension with time bins
- name: dim_has_heartrate
description: Simple boolean table (1 = True, 2 = False)
- name: dim_privacy
description: Simple boolean table ( 1 = Private/Only Me, 2 = Public/Everyone)
- name: dim_sport_type
description: Contains sport_type unique dimensions including the greather categories of (Cyling, Running, Other)
- name: sport_type_weekly_totals
description: Aggregated metrics by sport type per week
12 changes: 12 additions & 0 deletions analytics_dbt/models/marts/dim_activities.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
select
id
, name
, has_heartrate
, private
, visibility
, case
when sport_type in ('VirtualRide', 'Ride', 'MountainBikeRide') then 'Cycling'
when sport_type = 'Run' then 'Running'
else 'Other'
end as sport_type
from {{ ref('obt_clean_activities') }}
8 changes: 0 additions & 8 deletions analytics_dbt/models/marts/dim_activity_names.sql

This file was deleted.

10 changes: 0 additions & 10 deletions analytics_dbt/models/marts/dim_has_heartrate.sql

This file was deleted.

11 changes: 0 additions & 11 deletions analytics_dbt/models/marts/dim_privacy.sql

This file was deleted.

19 changes: 0 additions & 19 deletions analytics_dbt/models/marts/dim_sport_type.sql

This file was deleted.

111 changes: 25 additions & 86 deletions analytics_dbt/models/marts/fct_activities.sql
Original file line number Diff line number Diff line change
@@ -1,87 +1,26 @@
with activities as (
select
date,
time,
distance_miles,
moving_time_minutes,
elapsed_time_minutes,
total_elevation_gain_feet,
sport_type,
id,
achievement_count,
kudos_count,
comment_count,
athlete_count,
private,
visibility,
average_speed_mph,
max_speed_mph,
has_heartrate,
pr_count,
average_cadence,
average_temp,
average_watts,
max_watts,
weighted_average_watts,
kilojoules,
average_heartrate,
max_heartrate,
elev_high_feet,
elev_low_feet
from {{ ref('obt_clean_activities') }}

),

dates as (
select * from {{ ref('dim_dates') }}
),

sport_type as (
select * from {{ ref('dim_sport_type') }}
),

privacy as (
select * from {{ ref('dim_privacy') }}
),

has_heartrate as (
select * from {{ ref('dim_has_heartrate') }}
)

select
a.id as id_pk,
d.date_id,
a.time,
s.sport_type_pk as sport_type_fk,
p.visibility_pk as visibility_fk,
a.distance_miles,
a.moving_time_minutes,
a.elapsed_time_minutes,
a.total_elevation_gain_feet,
a.achievement_count,
a.kudos_count,
a.comment_count,
a.athlete_count,
a.average_speed_mph,
a.max_speed_mph,
a.pr_count,
a.average_cadence,
a.average_temp,
a.average_watts,
a.max_watts,
a.weighted_average_watts,
a.kilojoules,
a.average_heartrate,
a.max_heartrate,
a.elev_high_feet,
a.elev_low_feet,
h.has_heartrate_pk as has_heartrate_fk,
from activities as a
left join dates as d
on d.date = a.date and d.time = a.time
left join sport_type as s
on a.sport_type = s.sport_type
left join privacy as p
on a.private = p.private
left join has_heartrate as h
on a.has_heartrate = h.has_heartrate
id
, date
, time
, distance_miles
, moving_time_minutes
, elapsed_time_minutes
, total_elevation_gain_feet
, achievement_count
, kudos_count
, comment_count
, athlete_count
, average_speed_mph
, max_speed_mph
, pr_count
, average_cadence
, average_temp
, average_watts
, max_watts
, weighted_average_watts
, kilojoules
, average_heartrate
, max_heartrate
, elev_high_feet
, elev_low_feet
from {{ ref('obt_clean_activities') }}
16 changes: 16 additions & 0 deletions analytics_dbt/models/marts/sport_type_weekly_totals.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
select
date_trunc('week', strptime(fa.date, '%m-%d-%Y')) AS week_start
, da.sport_type
, sum(distance_miles) as distance_miles
, sum(moving_time_minutes) as moving_time_minutes
, sum(elapsed_time_minutes) as elapsed_time_minutes
, sum(total_elevation_gain_feet) as total_elevation_gain_feet
, sum(achievement_count) as achievement_count
, sum(kudos_count) as kudos_count
, sum(comment_count) as comment_count
, sum(pr_count) as pr_count
from {{ ref('fct_activities') }} as fa
left join {{ ref('dim_activities') }} as da
on fa.id = da.id
group by week_start, sport_type
order by week_start desc
2 changes: 1 addition & 1 deletion analytics_dbt/target/manifest.json

Large diffs are not rendered by default.

Binary file modified data/staging/strava.duckdb
Binary file not shown.

0 comments on commit 72ac665

Please sign in to comment.