-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. #6234
Comments
@KeeonTabrizi thanks for a thorough write-up and links to tangential issues and PRs! We are always interested to discuss opportunities to improve debugging / speed of query development. Question:
|
@dbeatty10 thanks for the comment. It's definitely intended for production use. The iterative development cycle can yield something that is production ready through these temporary objects and it's helpful to be an option. The alternative is to force translation of these objects back into CTE's and add more over-head to the process and ultimately lengthen the time to get something out the door when it was otherwise ready to ship. |
Gotcha @KeeonTabrizi What are you trying to optimize for in the iterative development cycle?
OptionsAs you're probably well-versed, there's a variety of options for developing dbt models including:
Let's explore what each of those are optimized for:
We can see in the chart above that A) and B) are generally opposites, as are C) and D). 'Tis the nature of trade-offs! Table materializationOptimized for B) and C) 💡 But we can unlock D) by mimicking the way Snowflake temporary tables behave in practice! You'd just add The lifespan of these "temp" tables would limited to a dbt "session" (spanning Snowflake connections) rather than just a single Snowflake connection session. Ephemeral and view materializationsOptimized for B) and D) Explicit CTEOptimized for A) and D). 💡 Although they are not optimized for C), adding a Example"Temp" tables within a dbt "session"
{{ config(
materialized="table"
) }}
select ...
from {{ ref('my_other_dbt_model') }}
{{ config(
materialized="table"
) }}
select ...
from {{ ref('my_data') }}
☝️ Uncomment above once you want it to act like a "temp" table in production. Leaving it commented will allow it to persist across sessions during development. UsageRebuilding just the downstream table (leaving the "temp" table as-is): dbt build -s your_table Rebuilding both the "temp" table and its downstream consumer: dbt build -s my_data your_table |
@dbeatty10 many thanks for the detailed response. I will make sure to review it in detail and respond - the week is a busy with holidays but please give me a bit and I will make sure to provide responses. |
@dbeatty10 hope my responses help. I'd also be happy to set up a quick 15 minute chat if that helps.
With regard to the scenarios you described above Consider a situation where one needs to build a model where likely there is a need draft a complex query (many dependencies, many transformations) and high computation / run-time query. Let's also consider other real-world considerations where the time is limited and it's not feasible to build the best most streamlined DBT pipeline to achieve the model. Often times when faced with such a task it is unknown what the ideal composite pieces of data are to achieve the end result. It is often a series of trial and error and takes multiple iterations to understand how to break the data up into logical components/transformations. Temporary table (instead of CTEs) allows one to keep data hot/available within a session so you can iterate on the query quickly. The end result is really no different than using CTEs. Most complex queries could also start with a series of CTEs. DBT has no issues with this, I don't think this is a question of best practices in analytics engineering it's just about supporting native functions of a database in the generation of data. In some databases like Redshift, if your sort keys etc. are not well defined to how the data is used using CTEs can have major slowdowns, but simply declaring temp tables of the equivalent CTE and asking redshift to ANALYZE the temporary objects before downstream use can well outperform the equivalent query by CTEs. |
Although not at all what we would recommend, the thing you described is actually possible via pre-hooks. In your original description, you already explained how this doesn't work as-is: {{ config(
materialized="table"
) }}
CREATE TEMPORARY TABLE my_data AS
SELECT ...
FROM {{ ref('my_other_dbt_model') }}
;
SELECT ...
FROM my_data But if you just add a
-- ⚠️ Using pre-hooks in this way is an anti-pattern! ⚠️
{{ config(
materialized="table",
pre_hook="
CREATE TEMPORARY TABLE my_data AS
SELECT ...
FROM {{ ref('my_other_dbt_model') }}"
) }}
SELECT ...
-- depends_on: {{ ref('my_other_dbt_model') }}
FROM my_data See here for an explanation why the dependency might need to be forced via Smaller data setsFor scenarios with a lot of trial & error iteration on a large data set, we’d recommend limiting the data like this or with a Hot data setsBut what about the goal of keeping a data set hot/available so you can iterate on a transformation query quickly? A regular But let's consider using a The only explanation I can think of:
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
1 similar comment
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Oh no closed! At the end of the day I believe this is a bad outcome to remain unsupported. If one can build a model with a CTE you should be able to build it with a temp table 1:1. If you need to get a model out the door and you are spending too much time waiting for results from your CTE you shouldn't be forced to break up the model early in it's development into materialized precursors nor should you be forced to have to convert your temp table work at the very end to because temp tables are not supported. There are real business constraints and timelines that not having this flexibility can penalize. I understand your argument with respect to best practices but ultimately it's simply a function of lack of support for a normal database mechanism in the name of a best practice, so I have a fundamental disagreement. I understand there's a whole other can of technical worms with respect to sessions/threading that get's introduced but that is a different conversation. |
Would love this feature as well! :) |
@dbeatty10 can we open this back up |
@dongchris @KeeonTabrizi I think @dbeatty10 did a nice job above laying out the rationale for why dbt both cannot and does not support this, for reasons of technical limitation (how dbt manages database connections) and best practice (code & workflow in dev should closely mirror the code that will be actually be running in prod), respectively. In the meantime, if you really want this support, I'd recommend you try out Doug's suggestion above, where you pass your "CTE" into a |
Thanks @jtcohen6 for the explanation. If we use the pre hook way, what if there are actual prehooks running, would having it in the config this way override existing ones? |
@dbeatty10 did indeed do a great job RE: rational - but I don't believe it really was about the technical reasons it is not supported - perhaps implied, but the discussion and rationale was squarely on the analytic engineering side. |
Is this your first time submitting a feature request?
Describe the feature
This feature request is to support using temporary tables (within a single session) instead of CTE references in the execution of a single DBT model.
To be clear as I've seen this confusion, this is not a request to materialize a temporary object - rather utilize a temporary object in the materialization itself.
I believe it would require a model specific over-ride to limit execution to within 1 thread / session (vs. the global thread default in
dbt_project.yml
. Additionally, the compilation / execution order would first require the execution of the temporary objects followed by the remaining query (which could actually include a CTE). For the examples below I will assume a Snowflake DB.Currently, DBT can compile a model below:
dbt_model.sql
The
run
target may look something likeA simple version of this query that would utilize a tempory table could look as follows:
dbt_model_temp_table.sql
The
run
target will currently generate invalid SQL as follows:The compiled version should first execute any create statements and then wrap the final query which generates the result set to materialize:
What I believe is needed is an additional config argument into the model to indicate temporary tables are used which would force a single session/thread for the model execution as well as re-order the execution of the model (first executing the temporary objects in the session, and finally generating the results for DBT to materialize).
Describe alternatives you've considered
Converting everything to CTEs or creating individual upstream models is the only option. However, I feel strongly enough in the development / iteration cycles around complex queries/transformation where temporary tables are used that having the option to directly materialize those queries through DBT models will be a win for developers.
Who will this benefit?
This will benefit anyone developing complex queries as well as those already using temporary tables which can improve debugging / speed of query development as results can remain available to an open session.
It allows a developer to move a query > model > production quicker and provides that option versus being forced to convert a query into CTEs and/or break up a query into component models with dependencies.
While, this may be theoretically better data engineering practice it shouldn't be forced by the lack of support for temporary tables.
See this comment which also describes some of the benefit: #2725 (comment)
Are you interested in contributing this feature?
Happy to ideate
Anything else?
Tangential but not directly related issues & PRS:
The text was updated successfully, but these errors were encountered: