Skip to content

Commit

Permalink
Replace snowplow_web with a base that is compatible with mobile event…
Browse files Browse the repository at this point in the history
…s [WIP]
  • Loading branch information
matus-tomlein committed Jul 5, 2023
1 parent fab1c34 commit dce41d8
Show file tree
Hide file tree
Showing 46 changed files with 3,135 additions and 822 deletions.
101 changes: 73 additions & 28 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,46 +24,91 @@ clean-targets:
- "dbt_packages"

vars:
surrogate_key_treat_nulls_as_empty_strings: true # turn on legacy behavior
snowplow_media_player:
surrogate_key_treat_nulls_as_empty_strings: true # turn on legacy behavior

# Sources
# snowplow__atomic_schema: 'atomic' # Only set if not using 'atomic' schema for Snowplow events data
# snowplow__database: # Only set if not using target.database for Snowplow events data -- WILL BE IGNORED FOR DATABRICKS
snowplow__events: "{{ source('atomic', 'events') }}"
snowplow__media_event_names: ['media_player_event']
snowplow__number_checkout_steps: 4
snowplow__number_category_levels: 4
snowplow__categories_separator: '/'
snowplow__use_product_quantity: false

snowplow__percent_progress_boundaries: [10, 25, 50, 75]
snowplow__valid_play_sec: 30
snowplow__complete_play_rate: 0.99
snowplow__max_media_pv_window: 10
# please set any of the below three variables to true if the related context schemas are enabled for your warehouse, please note it cannot be used to filter the data:
# set to true if the YouTube context schema is enabled
snowplow__enable_youtube: false
# set to true if the HTML5 media element context schema is enabled
snowplow__enable_whatwg_media: false
# set to true if the HTML5 video element context schema is enabled
snowplow__enable_whatwg_video: false
snowplow__media_player_event_context: "com_snowplowanalytics_snowplow_media_player_event_1"
snowplow__media_player_context: "com_snowplowanalytics_snowplow_media_player_1"
snowplow__youtube_context: "com_youtube_youtube_1"
snowplow__html5_media_element_context: "org_whatwg_media_element_1"
snowplow__html5_video_element_context: "org_whatwg_video_element_1"
snowplow__context_web_page: 'com_snowplowanalytics_snowplow_web_page_1'

# Variables - Standard Config
snowplow__start_date: '2020-01-01'
snowplow__backfill_limit_days: 30
snowplow__app_id: []
snowplow__derived_tstamp_partitioned: true
# Variables - Advanced Config
snowplow__lookback_window_hours: 6
snowplow__session_lookback_days: 730
snowplow__days_late_allowed: 3
snowplow__max_session_days: 3
snowplow__upsert_lookback_days: 30
snowplow__query_tag: "snowplow_dbt"
snowplow__dev_target_name: 'dev'
snowplow__allow_refresh: false
snowplow__enable_load_tstamp: true
# Variables - Databricks Only
# Add the following variable to your dbt project's dbt_project.yml file
# Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards) or the same value as your snowplow__atomic_schema (unless changed it should be 'atomic')
# snowplow__databricks_catalog: 'hive_metastore'

snowplow__percent_progress_boundaries: [10, 25, 50, 75]
snowplow__valid_play_sec: 30
snowplow__complete_play_rate: 0.99
snowplow__max_media_pv_window: 10
# please set any of the below three variables to true if the related context schemas are enabled for your warehouse, please note it cannot be used to filter the data:
# set to true if the YouTube context schema is enabled
snowplow__enable_youtube: false
# set to true if the HTML5 media element context schema is enabled
snowplow__enable_whatwg_media: false
# set to true if the HTML5 video element context schema is enabled
snowplow__enable_whatwg_video: false
snowplow__media_player_event_context: "{{ source('atomic', 'com_snowplowanalytics_snowplow_media_player_event_1') }}"
snowplow__media_player_context: "{{ source('atomic', 'com_snowplowanalytics_snowplow_media_player_1') }}"
snowplow__youtube_context: "{{ source('atomic', 'com_youtube_youtube_1') }}"
snowplow__html5_media_element_context: "{{ source('atomic', 'org_whatwg_media_element_1') }}"
snowplow__html5_video_element_context: "{{ source('atomic', 'org_whatwg_video_element_1') }}"
# # Completely or partially remove models from the manifest during run start.
# on-run-start:
# - "{{ snowplow_media_player_delete_from_manifest(var('models_to_remove',[])) }}"

# # Update manifest table with last event consumed per sucessfully executed node/model
# on-run-end:
# - "{{ snowplow_utils.snowplow_incremental_post_hook('snowplow_media_player') }}"

models:
snowplow_media_player:
+bind: false
+materialized: view
base:
manifest:
+schema: "snowplow_manifest"
scratch:
+schema: "scratch"
+tags: "scratch"
bigquery:
+enabled: "{{ target.type == 'bigquery' | as_bool() }}"
databricks:
+enabled: "{{ target.type in ['databricks', 'spark'] | as_bool() }}"
default:
+enabled: "{{ target.type in ['redshift', 'postgres'] | as_bool() }}"
snowflake:
+enabled: "{{ target.type == 'snowflake' | as_bool() }}"
web:
+schema: "derived"
+tags: "snowplow_web_incremental"
+tags: "snowplow_media_player_incremental"
+enabled: true
scratch:
+schema: "scratch"
+tags: "scratch"
interactions_this_run:
bigquery:
enabled: "{{ target.type == 'bigquery' | as_bool() }}"
databricks:
enabled: "{{ target.type in ['databricks', 'spark'] | as_bool() }}"
redshift_postgres:
enabled: "{{ target.type in ['redshift', 'postgres'] | as_bool() }}"
snowflake:
enabled: "{{ target.type == 'snowflake' | as_bool() }}"
custom:
+schema: "scratch"
+tags: "snowplow_web_incremental"
+tags: "snowplow_media_player_incremental"
+enabled: false
6 changes: 6 additions & 0 deletions docs/markdown/snowplow_media_player_atomic_docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ This context table contains the entities related to the HTML5 Media Element, ada
{% docs table_html_video_element_context %}
This context table contains the entities related to the HTML5 Video Element, adapted from the whatwg spec.
{% enddocs %}

{% docs table_events %}

The `events` table contains all canonical events generated by [Snowplow's](https://snowplow.io/) trackers, including web, mobile and server side events.

{% enddocs %}
43 changes: 43 additions & 0 deletions docs/markdown/snowplow_media_player_base_docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{% docs table_base_sessions_lifecycle_manifest %}

This incremental table is a manifest of all sessions that have been processed by the Snowplow dbt ecommerce package. For each session, the start and end timestamp is recorded.

By knowing the lifecycle of a session the model is able to able to determine which sessions and thus events to process for a given timeframe, as well as the complete date range required to reprocess all events of each session.

{% enddocs %}

{% docs table_base_incremental_manifest %}

This incremental table is a manifest of the timestamp of the latest event consumed per model within the Snowplow dbt ecommerce package as well as any models leveraging the incremental framework provided by the package. The latest event's timestamp is based off `collector_tstamp`. This table is used to determine what events should be processed in the next run of the model.
{% enddocs %}

{% docs table_base_new_event_limits %}

This table contains the lower and upper timestamp limits for the given run of the web model. These limits are used to select new events from the events table.

{% enddocs %}


{% docs table_base_events_this_run %}

For any given run, this table contains all required events to be consumed by subsequent nodes in the Snowplow dbt ecommerce package. This is a cleaned, deduped dataset, containing all columns from the raw events table as well as having the `page_view_id` joined in from the page view context, and all of the fields parsed from the various e-commerce contexts except the `product` context.

**Note: This table should be used as the input to any custom modules that require event level data, rather than selecting straight from `atomic.events`**

{% enddocs %}


{% docs table_base_sessions_this_run %}

For any given run, this table contains all the required sessions.

{% enddocs %}


{% docs table_base_quarantined_sessions %}

This table contains any sessions that have been quarantined. Sessions are quarantined once they exceed the maximum allowed session length, defined by `snowplow__max_session_days`.
Once quarantined, no further events from these sessions will be processed. Events up until the point of quarantine remain in your derived tables.
The reason for removing long sessions is to reduce table scans on both the events table and all derived tables. This improves performance greatly.

{% enddocs %}
Loading

0 comments on commit dce41d8

Please sign in to comment.