Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unique media identifier (close #59) #61

Conversation

georgewoodhead
Copy link
Contributor

Add unique media identifier.

Description & motivation

Generate a more robust unique media element identifier using a surrogate key generated from media_id, media_label, media_type and media_player_type. This fixes the chance of duplicate media ids in the media stats table (close #59).

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have raised a documentation PR if applicable (Link here if required)

@snowplowcla snowplowcla added the cla:yes [Auto generated] Snowplow Contributor License Agreement has been signed. label Nov 9, 2023
Copy link
Contributor

@rlh1994 rlh1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a changelog entry for this and we'll build it up as we go please, this release is likely to have a lot of changes/breaking changes so it would be good to keep track as we go

Copy link
Contributor

@rlh1994 rlh1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that might be worth doing, although I appreciate it would be a pain, would be to add (or alter an existing) event to the test data with the same media_id but a different e.g. media_label. This should have caused a failure in the old version but would then pass correctly with these fixes - that ensures we've actually fixed the issue AND stops us making the same error in the future?

CHANGELOG Outdated Show resolved Hide resolved
Copy link
Contributor

@matus-tomlein matus-tomlein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but it's quite confusing to have both the media_id and media_identifier in the derived tables. It's not obvious what is the difference between them, how do we explain it to users which one to use?

My suggestion would be to replace the media_id with the media_identifier (basically update and rename this macro).

If people are interested in the original player_id value from the YouTube or HTML5 context entity, maybe we can provide that as a separate field? I don't like the name player_id because it doesn't really identify the player, more the content. Perhaps content_id would be better? Or can we let users take the value directly from the context entities if they want?

@georgewoodhead
Copy link
Contributor Author

Yeah I get it is confusing. I think it is important we continue to expose the player ids from YouTube and HTML5 events in some way as these may be used as filters or join keys (I like your idea of naming it content_id!). As a side note, if we don't want to add this, it could become possible to use passthroughs to add in the original player id field with the base macros being applied in this release too.

Another option that came to mind, using the web package as an example, when we made the session identifier configurable we kept this called domain_sessionid and added a new field original_domain_sessionid. Whilst it’s not exactly the same scenario we could implement something similar - also this potentially causes less of a breaking change downstream when upgrading.

So to me we have the options:

  1. Remove the original media_id field completely, replacing it with the more robust identifier
  2. Remove the original media_id field completely, replacing it with the more robust identifier, but keep player ids from YouTube and HTML5 events as a field called content_id
  3. Overwrite the media_id field with the more robust identifier, and keep the original logic for media_id in a new field called original_media_id

WDYT?

@rlh1994
Copy link
Contributor

rlh1994 commented Nov 9, 2023

I think option 2, but maybe we call it player_id as that is more closely aligned to the original variable name in most cases from the context?

@matus-tomlein
Copy link
Contributor

Agreed, I like option 2! player_id would be good because that's also how we refer to it in the context entities, but it's a bit misleading – two player instances can have the same player_id if they play the same video. But maybe if we document it well what it actually means, it would be fine.

@georgewoodhead
Copy link
Contributor Author

Thanks both, I've made the changes for option 2! I've also changed the int tests to cover the cases of content sharing the same media label across different media types or media_player_ types. I fully expect the pr tests to not pass first time as only tested locally with snowflake 🤞

This change adds player_id to the base and stats table, but not the ad tables. If I'm right in understanding ads are only enabled with the new media plugin not the youtube or html5 players, so we wouldn't have a player id.

@georgewoodhead
Copy link
Contributor Author

Need to add a unique primary key to the snowplow_media_player_media_ad_views model for incremental runs to work correctly - at the moment duplicates are being added. With the row count test not being enabled in the int tests means this hasn't been flagged until now.

Copy link
Contributor

@matus-tomlein matus-tomlein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great to see both issues fixed!

docs/markdown/snowplow_media_player_common_cols.md Outdated Show resolved Hide resolved
docs/markdown/snowplow_media_player_common_cols.md Outdated Show resolved Hide resolved
models/base/scratch/base_scratch.yml Outdated Show resolved Hide resolved
@georgewoodhead georgewoodhead merged commit 87ddb6b into release/snowplow-media-player/0.7.0 Nov 14, 2023
6 of 8 checks passed
@georgewoodhead georgewoodhead deleted the feature/add-media-identifier branch November 14, 2023 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes [Auto generated] Snowplow Contributor License Agreement has been signed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants