-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-96] Allow unique_key for incremental materializations to take a list #2479
Comments
Thanks for the solid proposal, @azhard. I've now heard this from several members of the community. Though it's what we recommend, I understand that creating a single hashed surrogate key for each model is against some folks' warehousing paradigms. Currently, that requires overriding a complex merge macro, or the entire incremental materialization.
Could you clarify what you mean here? Is the BQ implementation of |
Hey @jtcohen6 the Essentially in merge.sql L25 dbt does this to generate the match predicate While this normally would work, the unique_key is substituted with I've taken a look at implementing this change to use unique_key as a list in the |
@jtcohen6 any idea on the |
I see what you mean. When I think of creating a surrogate key for an incremental model, I'm thinking of creating a column within that model, to be stored in the resulting table and passed as the {{ config(
materialized = 'incremental',
unique_key = 'unique_id'
) }}
select
date_day,
user_id,
{{ dbt_utils.surrogate_key('date_day', 'user_id') }} as unique_id
... You're right that, as a result of the way that merge macros are implemented on BigQuery, you cannot create the surrogate key directly within the config like so: {{ config(
materialized = 'incremental',
unique_key = dbt_utils.surrogate_key('date_day', 'user_id')
) }} I've now heard this change requested from several folks now, including (if I recall correctly) some Snowflake users who have found that merging on cluster keys improves performance somewhat. So I'm not opposed to passing an array of column names. I'm worried that @drewbanin What do you think? Is that too much config-arg creep? |
@jtcohen6 I'm into the idea of supporting an array of values for Either way though, I do think we should mandate that So, this would not be supported:
but this could be:
I would like to think a little harder about how this would work for databases that do not support
It's not clear to me yet how we should build this delete statement with multiple keys, or if we should just not support multiple keys on pg/redshift/etc |
To the last point, Postgres, Redshift, and Snowflake all support delete
from {{ target_relation }}
using {{ tmp_relation }}
where
{% for column_name in unique_combination_of_columns %}
{{ target_relation }}.{{ column_name }} = {{ tmp_relation }}.{{ column_name }}
{{- "and" if not loop.last }}
{% endfor %} |
Any advice on how I can use multiple key columns locally with BigQuery right now? |
If need be, you could override the {% if unique_key %}
{% set unique_key_list = unique_key.split(",") %}
{% for key in unique_key_list %}
{% set this_key_match %}
DBT_INTERNAL_SOURCE.{{ key }} = DBT_INTERNAL_DEST.{{ key }}
{% endset %}
{% do predicates.append(this_key_match) %}
{% endfor %}
{% else %} and then pass a comma-separated list to {{ config(
materialized = 'incremental',
unique_key = 'date_day, user_id'
) }} That's obviously not the eventual implementation we're going for, but I think that would work for your specific use case. |
Hey @jtcohen6 that works perfectly, thanks for the help! |
Seeing the discussion and the proposed solution, could we rename this issue to "Allow unique_key to take a list"? To make it more generic. And I am ok with using |
I've had a read through both of the above threads and I'm happy to give this a go. In summary:
Considerations: I would like to introduce a jinja test for lists if everyone is in agreement. So we can do things like This will probably benefit a number of macros with list or string behaviour. |
@triedandtested-dev Thanks for your interest! I'd welcome a PR for this :) In terms of your three points:
The |
Happy to keep the ModelConfig out of scope for this issue. I'll get cracking on a PR for this! The jinja "is list" addition should help with readability going forward but completely agree its not technically necessary. |
@triedandtested-dev : How is it going with the PR? |
+1 running into his issue |
* Add unique_key to NodeConfig `unique_key` can be a string or a list. * merge.sql update to work with unique_key as list extend the functionality to support both single and multiple keys Signed-off-by: triedandtested-dev (Bryan Dunkley) <[email protected]> * Updated test to include unique_key Signed-off-by: triedandtested-dev (Bryan Dunkley) <[email protected]> * updated tests Signed-off-by: triedandtested-dev (Bryan Dunkley) <[email protected]> * Fix unit and integration tests * Update Changelog for 2479/4618 Co-authored-by: triedandtested-dev (Bryan Dunkley) <[email protected]>
Not sure if out of scope, but the change isn't propagated to the rest of dbt. Thanks, |
@DanDaDataEngineer The scope of this change was for models using the incremental materialization only. You're welcome to open a separate issue for supporting multiple unique keys in snapshots. For my part, I'd just say that the presence of a reliable unique identifier is even more important in snapshots, since the stakes are data loss, or duplication requiring manual intervention. |
Describe the feature
Right now when creating a model in dbt with materialization set to
incremental
you can pass in a single column tounique key
that will act as thekey
for merging.In an ideal world you would be able to pass in multiple columns as there are many cases where a table has more than one column that defines it's primary key.
The simplest solution would be to change to
unique_key
to take in a list (in addition to a string for backwards compatibility) and create thepredicates
for the merge based on the list vs. just the single column.This might not be ideal as the param name
unique_key
implies a single key. Alternatives would be adding a new optional parameterunique_key_list
orunique_keys
that always take a list and eventually deprecate theunique_key
parameter.Describe alternatives you've considered
Not necessarily other alternatives but another thing to consider is the use of
unique_key
throughout the dbt project. It would stand to reason that whatever change is made here would apply to all other usages ofunique_key
. This can be done in one large roll-out or in stages such as with merges first, then upserts, then snapshots, etc.Additional context
This feature should work across all databases.
Who will this benefit?
Hopefully most dbt users. Currently the only workaround for this is using dbt_utils.surrogate_key which a) doesn't work for BigQuery and b) should ideally be an out of the box dbt feature.
The text was updated successfully, but these errors were encountered: