-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support temp tables in Snowflake "table" materializations #2725
Comments
Hey @gil-walzer-zocdoc, what's your use case for temp tables as a materialization? dbt opens several different Snowflake connections (sessions) within a given run, even more so when running with multiple threads. Since temp tables are dropped at the end of each session, they don't stick around for downstream models to select from them. We do support a If you don't want those models sticking around as tables between runs—to avoid storage costs, I'm guessing?—I'd advise:
|
Ah ok, I didn't realize dbt actually opened up several Snowflake connections, that certainly dashes my hopes. I described our use case in the description; we're operating on a large dataset, but the domain is still pretty new to our business so our transforms are still going through a lot of logic changes. So we wanted to store intermediate transforms as temp tables, not CTEs, but we also were running into issues where the transient table would be created with a set of columns that wouldn't match the logic in a subsequent run. I should mention that we're leveraging a Snowflake database clone to run these transforms on while we develop. I came up with a custom method to set up clones and point Snowflake towards them via the |
Right on, that makes sense.
I don't think I understand this piece. In subsequent runs, does the transient intermediate table encounter a mismatch with data changes in upstream sources, or with logic changes in downstream models? Wouldn't you be rerunning the intermediate model (and recreating the transient table) if it's within the modified section of the DAG? Zero-copy cloning is a powerful Snowflake feature. Since it's available as DDL statements, we've found it to play nicely with existing dbt constructs (hooks and operations), and we haven't felt the need to wrap it more natively in an adapter method or materialization. Claire has a great discourse post about quickly spinning up a dev environment via ZCCs. |
The issue was actually just for a single model which had a data change between two runs, not for data changes between models. We were using an What I was seeing was this, and remember this is only about a single model: On the first run of our DAG (on a fresh database), DBT created transient tables from the columns specified in the model. On a subsequent run, the output columns in the model changed, and the materialization prepared a new temp table (suffixed with _dbt_tmp) with the correct columns, but did not replace the transient table before attempting to insert. If you believe we're misusing the incremental materialization and would be better served with an alternative, please let us know. |
Got it! I don't think you're misusing the incremental materialization, though you should be aware that the inability to capture column additions/deletions/changes is a significant limitation. There's a long-lived issue with proposed remedies: #1132. Generally, models that frequently update their columns are poor candidates for incremental builds. I'm going to close this issue, in favor of the bigger-picture discussion, since there isn't a specific code change we'll be making to support temp tables as a materialization. |
Sounds good. Thanks for the discussion. |
Hi @jtcohen6, we at Zocdoc have started looking into supporting concurrent DBT runs without threading (as separate processes). I wanted to double-check that your point above is still accurate- that Snowflake DBT opens and closes sessions between models. Is that still true? |
Yes, that's still true. This rough mapping holds: one model = one materialization = one Snowflake session/connection (possibly more if additional queries are run before/after) |
Hey, @jtcohen6. We're migrating from traditional SQL transformations. We used temp tables in those because the datasets were large. I'd love to stick with this instead of using CTEs, so I'm thinking transient tables. I also find temp tables more readable, easier to debug, more modular, etc. The downside with anything but CTEs in dbt is naming issues. We have 100+ transformations that run every hour. It's likely that two separate transformations would create "sales" temp tables that wouldn't use the same logic. Since temp tables are scoped to their session, we didn't have to worry about the temp table names of one transformation conflicting with those of another. I was hoping temp tables in dbt would do the trick, but sounds like that's out. Any recommendations on how to avoid naming conflict issues without an unreasonable name for every temp table (like |
I personally, would love to see support for temporary tables and the ability to have the model run with a single thread to avoid the issues with multiple sessions DBT creates. |
Describe the feature
Please allow us to specify a "temporary" option in the config that will create a temp table materialization.
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/plugins/snowflake/dbt/include/snowflake/macros/materializations/table.sql#L26
Describe alternatives you've considered
I could create a custom materialization that does what is described, but the change seems sufficiently small and useful to go in DBT's materialization.
Additional context
Snowflake offers two alternatives to permanent tables- temporary tables and transient tables. Temp tables are dropped at the end of the session while transient tables must be explicitly dropped, otherwise they incur charges. DBT's
table
materialization currently forces users to use transient tables, and this should be opt-out.Who will this benefit?
Users that need temp tables for their projects, but don't want those tables to persist between runs, would benefit from having those tables be temporary. Projects that have frequent logic changes to their intermediate transforms as well as large datasets fit this description.
Are you interested in contributing this feature?
I would be willing to make this change
The text was updated successfully, but these errors were encountered: