Skip to content

Commit

Permalink
Replace snowplow_web with a base that can be compatible with mobile e…
Browse files Browse the repository at this point in the history
…vents (close #45)

PR #44
  • Loading branch information
matus-tomlein committed Aug 18, 2023
1 parent a1b859a commit c854310
Show file tree
Hide file tree
Showing 68 changed files with 3,011 additions and 1,662 deletions.
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# dbt-snowplow-media-player

A fully incremental model that transforms media player event data into derived tables for easier querying generated by the Snowplow [JavaScript tracker][javascript-tracker] in combination with media tracking specific plugins such as the [Media Tracking plugin][media-tracking] or the [YouTube Tracking plugin][youtube-tracking]. The package is built on top of the [dbt-snowplow-web package][dbt-snowplow-web] taking that as a basis to carry out the incremental update. It is therefore designed to be run together with the web model very similar to how a custom module would run.
A fully incremental model that transforms media player event data into derived tables for easier querying generated by the Snowplow [JavaScript tracker][javascript-tracker] in combination with media tracking specific plugins such as the [Media Tracking plugin][media-tracking] or the [YouTube Tracking plugin][youtube-tracking].

Please refer to the [doc site][snowplow-media-player-docs] for a full breakdown of the package.

Expand Down Expand Up @@ -95,7 +95,5 @@ limitations under the License.
[discourse-image]: https://img.shields.io/discourse/posts?server=https%3A%2F%2Fdiscourse.snowplow.io%2F
[discourse]: http://discourse.snowplow.io/

[dbt-snowplow-web]: https://hub.getdbt.com/dbt-labs/snowplow/latest/

[snowplow-media-player-docs-dbt]: https://snowplow.github.io/dbt-snowplow-media-player/#!/overview/snowplow_media_player
[snowplow-media-player-docs]: https://docs.snowplow.io/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-media-player-data-model/
137 changes: 92 additions & 45 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,68 +2,115 @@ name: 'snowplow_media_player'
version: '0.5.2'
config-version: 2

require-dbt-version: [">=1.4.0", "<2.0.0"]
require-dbt-version: ['>=1.4.0', '<2.0.0']

profile: 'default'

dispatch:
- macro_namespace: dbt
search_order: ['snowplow_utils', 'dbt']

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
docs-paths: ["docs"]
snapshot-paths: ["snapshots"]
model-paths: ['models']
analysis-paths: ['analyses']
test-paths: ['tests']
seed-paths: ['seeds']
macro-paths: ['macros']
docs-paths: ['docs']
snapshot-paths: ['snapshots']

target-path: "target"
target-path: 'target'
clean-targets:
- "target"
- "dbt_packages"
- 'target'
- 'dbt_packages'

vars:
surrogate_key_treat_nulls_as_empty_strings: true # turn on legacy behavior
snowplow_media_player:

# Variables - Warehouse and tracker
snowplow__percent_progress_boundaries: [10, 25, 50, 75]
snowplow__events: '{{ source("atomic", "events") }}'
snowplow__dev_target_name: 'dev'
# snowplow__atomic_schema: 'atomic' # Only set if not using 'atomic' schema for Snowplow events data
# snowplow__database: # Only set if not using target.database for Snowplow events data -- WILL BE IGNORED FOR DATABRICKS

# Variables - Operation and logic
snowplow__complete_play_rate: 0.99
snowplow__max_media_pv_window: 10
snowplow__valid_play_sec: 30
surrogate_key_treat_nulls_as_empty_strings: true # turn on legacy behavior
snowplow__media_event_names: ['media_player_event']
snowplow__start_date: '2020-01-01'
snowplow__backfill_limit_days: 30
snowplow__lookback_window_hours: 6
snowplow__session_lookback_days: 730
snowplow__days_late_allowed: 3
snowplow__max_session_days: 3
snowplow__upsert_lookback_days: 30
snowplow__allow_refresh: false

# Variables - Contexts, filters, and logs
# please set any of the below three variables to true if the related context schemas are enabled for your warehouse, please note it cannot be used to filter the data:
# set to true if the YouTube context schema is enabled
snowplow__enable_youtube: false
# set to true if the HTML5 media element context schema is enabled
snowplow__enable_whatwg_media: false
# set to true if the HTML5 video element context schema is enabled
snowplow__enable_whatwg_video: false
snowplow__app_id: []

snowplow__percent_progress_boundaries: [10, 25, 50, 75]
snowplow__valid_play_sec: 30
snowplow__complete_play_rate: 0.99
snowplow__max_media_pv_window: 10
# please set any of the below three variables to true if the related context schemas are enabled for your warehouse, please note it cannot be used to filter the data:
# set to true if the YouTube context schema is enabled
snowplow__enable_youtube: false
# set to true if the HTML5 media element context schema is enabled
snowplow__enable_whatwg_media: false
# set to true if the HTML5 video element context schema is enabled
snowplow__enable_whatwg_video: false
snowplow__media_player_event_context: "{{ source('atomic', 'com_snowplowanalytics_snowplow_media_player_event_1') }}"
snowplow__media_player_context: "{{ source('atomic', 'com_snowplowanalytics_snowplow_media_player_1') }}"
snowplow__youtube_context: "{{ source('atomic', 'com_youtube_youtube_1') }}"
snowplow__html5_media_element_context: "{{ source('atomic', 'org_whatwg_media_element_1') }}"
snowplow__html5_video_element_context: "{{ source('atomic', 'org_whatwg_video_element_1') }}"
# Variables - Warehouse Specific
snowplow__media_player_event_context: 'com_snowplowanalytics_snowplow_media_player_event_1'
snowplow__media_player_context: 'com_snowplowanalytics_snowplow_media_player_1'
snowplow__youtube_context: 'com_youtube_youtube_1'
snowplow__html5_media_element_context: 'org_whatwg_media_element_1'
snowplow__html5_video_element_context: 'org_whatwg_video_element_1'
snowplow__context_web_page: 'com_snowplowanalytics_snowplow_web_page_1'
snowplow__derived_tstamp_partitioned: true
snowplow__query_tag: 'snowplow_dbt'
snowplow__enable_load_tstamp: true
# Databricks Only
# Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards) or the same value as your snowplow__atomic_schema (unless changed it should be 'atomic')
# snowplow__databricks_catalog: 'hive_metastore'

# Completely or partially remove models from the manifest during run start.
on-run-start:
- '{{ snowplow_media_player_delete_from_manifest(var("models_to_remove",[])) }}'

# Update manifest table with last event consumed per sucessfully executed node/model
on-run-end:
- '{{ snowplow_utils.snowplow_incremental_post_hook("snowplow_media_player") }}'

models:
snowplow_media_player:
+bind: false
+materialized: view
web:
+schema: "derived"
+tags: "snowplow_web_incremental"
+enabled: true
base:
manifest:
+schema: 'snowplow_manifest'
scratch:
+schema: 'scratch'
+tags: 'scratch'
bigquery:
+enabled: '{{ target.type == "bigquery" | as_bool() }}'
databricks:
+enabled: '{{ target.type in ["databricks", "spark"] | as_bool() }}'
default:
+enabled: '{{ target.type in ["redshift", "postgres"] | as_bool() }}'
snowflake:
+enabled: '{{ target.type == "snowflake" | as_bool() }}'
media_base:
+schema: 'derived'
+tags: 'snowplow_media_player_incremental'
scratch:
+schema: "scratch"
+tags: "scratch"
interactions_this_run:
bigquery:
enabled: "{{ target.type == 'bigquery' | as_bool() }}"
databricks:
enabled: "{{ target.type in ['databricks', 'spark'] | as_bool() }}"
redshift_postgres:
enabled: "{{ target.type in ['redshift', 'postgres'] | as_bool() }}"
snowflake:
enabled: "{{ target.type == 'snowflake' | as_bool() }}"
+schema: 'scratch'
+tags: 'scratch'
media_plays:
+schema: 'derived'
+tags: 'snowplow_media_player_incremental'
media_stats:
+schema: 'derived'
+tags: 'snowplow_media_player_incremental'
custom:
+schema: "scratch"
+tags: "snowplow_web_incremental"
+schema: 'scratch'
+tags: 'snowplow_media_player_incremental'
+enabled: false
6 changes: 6 additions & 0 deletions docs/markdown/snowplow_media_player_atomic_docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ This context table contains the entities related to the HTML5 Media Element, ada
{% docs table_html_video_element_context %}
This context table contains the entities related to the HTML5 Video Element, adapted from the whatwg spec.
{% enddocs %}

{% docs table_events %}

The `events` table contains all canonical events generated by [Snowplow's](https://snowplow.io/) trackers, including web, mobile and server side events.

{% enddocs %}
43 changes: 43 additions & 0 deletions docs/markdown/snowplow_media_player_base_docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{% docs table_base_sessions_lifecycle_manifest %}

This incremental table is a manifest of all sessions that have been processed by the Snowplow dbt media package. For each session, the start and end timestamp is recorded.

By knowing the lifecycle of a session the model is able to able to determine which sessions and thus events to process for a given timeframe, as well as the complete date range required to reprocess all events of each session.

{% enddocs %}

{% docs table_base_incremental_manifest %}

This incremental table is a manifest of the timestamp of the latest event consumed per model within the Snowplow dbt media package as well as any models leveraging the incremental framework provided by the package. The latest event's timestamp is based off `collector_tstamp`. This table is used to determine what events should be processed in the next run of the model.
{% enddocs %}

{% docs table_base_new_event_limits %}

This table contains the lower and upper timestamp limits for the given run of the web model. These limits are used to select new events from the events table.

{% enddocs %}


{% docs table_base_events_this_run %}

For any given run, this table contains all required events to be consumed by subsequent nodes in the Snowplow dbt media package. This is a cleaned, deduped dataset, containing all columns from the raw events table as well as having the `page_view_id` joined in from the page view context, and all of the fields parsed from the various media contexts.

**Note: This table should be used as the input to any custom modules that require event level data, rather than selecting straight from `atomic.events`**

{% enddocs %}


{% docs table_base_sessions_this_run %}

For any given run, this table contains all the required sessions.

{% enddocs %}


{% docs table_base_quarantined_sessions %}

This table contains any sessions that have been quarantined. Sessions are quarantined once they exceed the maximum allowed session length, defined by `snowplow__max_session_days`.
Once quarantined, no further events from these sessions will be processed. Events up until the point of quarantine remain in your derived tables.
The reason for removing long sessions is to reduce table scans on both the events table and all derived tables. This improves performance greatly.

{% enddocs %}
Loading

0 comments on commit c854310

Please sign in to comment.