-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add full_refresh option to on_schema_change of incremental models #6412
feat: add full_refresh option to on_schema_change of incremental models #6412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pol-defont-reaulx Thanks for the contribution! I know many folks have wanted this for a while, so props to you for rolling up your sleeves and starting to make it happen.
After some local testing, I've realized there's a more foundational issue at play here. Given a model like this one:
{{ config(
materialized = 'incremental',
on_schema_change = 'full_refresh'
) }}
with source_data as (
select * from {{ ref('some_model') }}
{% if is_incremental() %}
-- always last 3 days (PostgreSQL)
where date_day > (current_date - 3 days)
{% endif %}
),
select ...
If the on_schema_change
behavior is detected, and we need to move into "full refresh mode" — we actually need to re-compile the model's SQL, with is_incremental
now returning false
. Why? Otherwise, we'll be replacing the existing table with only the past 3 days of data, when what we really want to do is start the whole process over again, as if we had configured the model with full_refresh: true
or passed in the --full-refresh
flag.
I'm not sure of the right way to trigger that re-compilation, let alone how to do it from within the Jinja (materialization) context. These are two clearly separated steps today, and dbt expects a model's compiled SQL to be available before it starts doing anything materialization-related.
There's also a performance consideration: By the time we know there's been a schema change, we've already created the temp table with new data, which now needs to be thrown away in favor of the "full" version. We could try hacking around that, by using get_columns_in_query
(where false limit 0
) to detect schema changes, instead of actually running + saving the query, but there's always risk involved with that sort of proxy approach.
(cc @colin-rogers-dbt - this could be a fun problem to jam on, but definitely not trivial)
Some more tactical considerations, typed out before I started thinking about the point above:
- Ultimately, we'll want automated testing for this. The existing functional tests for
on_schema_change
behavior in incremental models are defined here: https://github.com/dbt-labs/dbt-core/tree/main/tests/functional/incremental_schema_tests - In order to support this behavior on all our adapters, we'll need similar changes to the incremental materializations in
dbt-snowflake
,dbt-bigquery
, anddbt-spark
. Several of these don't use an "alter-rename-swap" mechanism, but instead just an atomiccreate or replace
.
@@ -128,6 +128,9 @@ | |||
|
|||
{% do exceptions.raise_compiler_error(fail_msg) %} | |||
|
|||
{% elif on_schema_change == 'full_refresh' %} | |||
{{ return({"full_refresh": True}) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about some additional debug-level logging here? Something like:
{% do log("Full refreshing " ~ target_relation ~ " on account of schema change") %}
{% if dest_columns is mapping and dest_columns.get("full_refresh") %} | ||
{% set build_sql = get_create_table_as_sql(False, intermediate_relation, sql) %} | ||
{% set need_swap = true %} | ||
{% else %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need logic like this in every incremental
materialization, including the modified versions in other adapter plugins we maintain:
Any progress on this? |
It wasn't picked up on our side, so no. @dbeatty10 do you think you could pick that up as part of the incremental effort? |
This will be quite tricky to implement! As described in my comment above - we to re-compile the model's SQL mid-materialization, now with |
Oh, by "pick that up" I meant the refinement, not the implementation! So no @nbsmo4, no progress on it sadly :/ |
Would be amazing if this feature makes it through, won't be able to contribute myself unfortunately. |
Thanks for taking the time to open this PR @pol-defont-reaulx! Since opening, we've decoupled dbt Adapters from dbt Core, and this code now lives in a separate repo: dbt-adapters. A consequence of the decoupling is that PR can't be merged anymore as is, so we're closing it. For more context see #9171. The linked issue has already been transferred. Please don't hesitate to re-open your proposed code changes within this PR in the dbt-adapters repo. |
resolves dbt-labs/dbt-adapters#154
Description
I added a full_refresh option on on_schema_change option for incremental models. Now, with this option and when finding a schema change on a model it will do a full refresh on this model
Checklist
changie new
to create a changelog entry