-
Notifications
You must be signed in to change notification settings - Fork 1.6k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage data warehouse objects (other than tables and views) #3391
Comments
There is definitely a general interest in being able to further simplify at least some level of object management with dbt. I know it would be useful for us in our projects. One issue with schemachange in particular that I see is that it's intentionally not idempotent. It's intentionally the opposite of idempotent - it intentionally creates a database change table and uses that to manage deployment of specific objects and changes to the database. This makes sense for managing a transaction processing database like Oracle or SQL Server or postgres when the state of the database is gigantic - it's all the schema and however many rows through all the tables, and it can't be just dropped and recreated from an external reference source because the database is, itself, the most compact reference source for all of its state. However, data warehouses are a totally different beast. For the most part (with some notable exceptions if you're using things like dbt snapshots) they are not systems of record that contain the representation of all the state of your business. They're pulling in data from external sources and doing a series of complicated transformations to produce useful output. And the power of dbt is that you can get rid of all that nasty state tracking by literally dropping and recreating all of your models every time in the database - no state tracking of what dimension or fact table or transformation table got updated when or not. All the state is coming in from other systems. Now, for performance reasons, there are compromises like incremental models. Also, in practice, a lot of places with large amounts of data aren't necessarily dropping and recreating everything every day. Yes, there are friction points with the other objects like users, warehouses, file formats, etc. above and that's the reason for this ticket. It's just that in my humble opinion (and I am definitely interested in a good debate on this) non-idempotent solutions set us back and also make things even more confusing when interacting with dbt which has such a strong pro-idempotency philosophy. |
@boxysean Such a good topic! Thanks for opening. Let's operate from the premise that every object we're talking about here can be managed via SQL. I don't think that's true today of Databricks SQL endpoints—these require a separate API request—but I hope that will be true, someday. In general, we like to take the view that if it's in a database/warehouse/query engine, and it's important, it ought to be SQL-addressable. The last year of BigQuery development has certainly vindicated that point of view. So: If an object can be created, altered, and dropped by SQL, then dbt can manage it. The question is: What's the best dbt construct to use? To paraphrase Count Leo, relational objects—tables, views, even materialized views—are all quite alike; every non-relational data warehouse object is strange in its own way. This is a distinction I find productive, because I think about these categories quite differently:
1. Logic-bearing objectsThe first category includes relational objects (tables, views, materialized views), but I'd argue it might also include UDFs. (More on that in a second.) The best dbt construct for these is the SQL (or CSV) file: models, snapshots, seeds. I so agree with @codigo-ergo-sum's point that these are the things dbt expects to operate on, and to do so with idempotence; that this assumption powers so much of dbt's intuition today. UDFsLet's talk for a moment about UDFs, since there are a lot of upvotes (but not a lot of comments) over in #136. I've seen two compelling treatments of UDFs in the wild:
Some data warehouses, namely BigQuery, give the option of defining UDFs as temporary (defined in a query, used in that query, gone thereafter) or persistent (database object). That distinction roughly maps to the two options above. We're not the only ones who have some uncertainty here. 2. Shells, conduits, pointersIn this category I might include things like sources, exposures (?), schemas, Snowflake databases, compute resources (Snowflake warehouses or Databricks endpoints). These are the constructs which "standard" dbt operations expect to already be in place. In the ideal case, dbt gets to pick where some sort of Terraform-y process has left off. That setup process (e.g. Permifrost) promises to take a totally new, spotless, empty warehouse/project/account and put in place just the necessary scaffolding (e.g. our recommended Snowflake setup), just enough so that dbt can start building. As far as dbt is concerned, the best way to define these objects is in YAML, not SQL: there's no business logic, just attributes. dbt gains a lot by knowing about these objects, and referencing them, but it doesn't expect to actually manage them. Yet here's the kicker: if dbt knows about them, it's 100% possible to write dbt macros that will manage them, as hooks and operations. This is how I think about the I'd want to take a similar tack for other object types that are more input than output for dbt operations. It seems Snowflake has the majority of them: databases, warehouses, stages, file formats, streams, changes. It's easy enough to write the exact SQL needed to create any of them, wrap it in a macro, and call it from a hook or operation. But that's a lot of boilerplate SQL to write and rewrite. It would feel much better to define macros, such as So: where to define (or find) those essential values? dbt already knows about some of these (databases, warehouses) via configs on first-order objects ( Even then, the version: 2
asset:
- name: tsv_handler
asset_type: file_format
options:
type: csv
compression: auto
record_delimiter: "\t"
- name: json_handler
asset_type: file_format
options:
... An answer that is possible today is to define a macro that exists only to wrap and return static data, pythonic arrays and dictionaries. For instance: {% macro get_my_file_formats() %}
{% do return([
{
"name": "tsv_handler",
"options": {
"type": "csv",
"compression": "auto",
"record_delimiter": "\t"
},
{
"name": "json_handler",
"options": ...
},
...
]) %}
{% endmacro %} For what it's worth, this is how the I think there's plenty we can do to improve that workflow, and it feels to me like the right thread to pull on. We should be opinionated about what dbt should do by default, which is managing logic-bearing objects, while also cognizant of all the things dbt can do to set up its own scaffolding, comprised of all the other, stranger objects in the database. |
Thank you @jtcohen6 and @codigo-ergo-sum for the thoughtful responses. Yes from a 10,000 foot view I see that it is easy for me to group the non-table, non-view resources into a big bucket and suggest they all be managed with the same finesse as tables and external tables. Indeed, with each resource, we may need to make a nuanced opinion about how to manage the lifecycle of the resource, and whether it fits the idempotent quality that dbt exhibits today. I recognize this and appreciate the patience in your explanations. Zooming in, I am particularly interested in a capability managing UDFs with dbt. I think dbt users and community would benefit from a similar Main reason 1: Many data management professionals have successfully spent their careers writing procedural SQL (e.g., PL/SQL). I think it's safe to say there is an industry-wide re-platforming trend to cloud data warehouses, and many career data professionals expect that they can continue writing procedural SQL on their new platforms. (And they are able to! Albeit with primarily Javascript, at the time of this comment.) Main reason 2: People who manage data re-platforming projects AND who are adding dbt to their data platform would appreciate managing UDFs (and similar) with dbt. They are interested in making their team effective, and teams of career data management professionals want to use procedural SQL. (I am recounting my experience talking with such managers here.) Small reason 3: UDFs can be complementary to the opinionated way that dbt transforms data. For example, BigQuery and Snowflake Javascript UDFs allow programmers to use Javascript to work with JSON naturally. Small reason 4: UDFs can be shared at the database-level so that dbt programmers and non-dbt programmers can share business logic. I must admit both of my main reasons don't quite feel "good" as I personally think data management professionals should use dbt for most/all data transformation for general-purpose analysis use-cases. However, part of me wonders if adding UDF management could help the adoption of dbt in hesitant organizations and/or hesitant individuals; essentially dbt penetration increases and more people appreciate how dbt's opinionated data modeling process could benefit them. Thanks for laying out the story of |
This is probably a bit tangential to the original point of the topic, but @boxysean - have you tested how performant UDFs are at scale? Or anyone else that you're aware of? I'm thinking if, for example, you're calling a UDF inline in a SQL query that, say, does some complicated string processing or a numerical calculation, and say the SQL query is querying a billion-row underlying table and needs to call the UDF once for each incoming row, so a billion times, then is Snowflake having to perform some sort of internal transition a billion times between the execution context of the SQL statement and the execution context of the UDF? In other words, is it similar to embedding a scalar subquery in the projection/SELECT portion of a SQL statement, where the scalar subquery needs to be executed once for each row which at scale is not performant? This is essentially moving back away from set-based processing to row-based processing, where in general I've found that performant SQL and usage of databases in general needs to be the opposite of this, which is moving from row-based logic to set-based logic. I'm asking this because the vision you're describing above describes a scenario where UDFs allow elegant encapsulation of logic and therefore in some ways are a superior way to embedding logic in SQL statements which then in turn makes the case that dbt should have better native handling of UDFs as first-class dbt objects. But if this is not performant at scale then UDFs perhaps become less attractive in general. Don't get me wrong - it doesn't mean that we shouldn't continue having this discussion, it just means that UDFs might not be so useful in a wide range of scenarios. This is probably something that could be verified by testing, I admit. I'm a bit swamped right now so don't have time to look at it but it would be interesting to see what Snowflake or BQ do if you call a UDF, say, a billion times in one SQL query. |
@codigo-ergo-sum I have also not tested the performance of UDFs at-scale, and agreed that UDFs could be significantly inferior to plain ol' SQL in many/most cases. Thank you for highlighting some of the tension of the "managing UDFs" as I've written it out. Reflecting back, I also do not find my previous comment a compelling argument for the prioritization of my issue here. The audience who would be happy to see this feature are not you and I, however, it's folks who have not used dbt. Perhaps there's a better class of data warehouse objects to use as an example in this story? A class of objects both you, I, and folks who have not used dbt could be excited about? Snowflake Streams, perhaps? |
I feel there is an excellent case to be made to include management of row access policies (Snowflake in our case) and associating them with models. dbt already creates schemas and it would be useful to also create databases (and roles). There is a need for some metadata management to handle deltas with these objects, so that unused objects are removed, and it would be good to know whether dbt will manage schemas and databases or whether that should be left to a third party tool like Terraform. |
@lyschoening and I have been discussing how to manage complex Snowflake Users and Roles. Scenario: a security-conscious organization wants to tightly control Snowflake User access to data on an individual basis. Due to Snowflake's model of specifying a Role to execute a query under (docs), it means that a However, a limitation of the Snowflake OAuth integration is that an authorization occurs for a single role, not allowing for switching roles in a database session (e.g., in dbt Cloud, docs). One solution could be to have a 1:1 user-to-role mapping (e.g., Such an organization could benefit from managing these roles within dbt itself! Alternatives to this are Terraform management, or Permifrost management. |
@boxysean This is really helpful! Not the first time I've heard about the one-role-per-user pattern, and I'd like to develop a clear set of recommendations when using this pattern with dbt + Snowflake.
This makes sense to me. It's very much like what we do with the The OAuth limitation is a pretty serious one. Worth looking more into external OAuth, which doesn't have this limitation. External OAuth has come up before, in other contexts, as a preferable alternative to the current OAuth implementation built into Snowflake /
Agreed. The naive version of this in dbt might look like implementing reusable "role groups," as in, reusable vars or macros that just store a list of user-roles. When each model runs, it grants access to the appropriate group of user-roles. The better version of this might look like the orchestration tool (dbt Cloud) providing those "groups," as env vars that it sets using information about users it knows about, and the permissions/license type each one has, in a given account/project. For instance: I'm sure there are many limitations to this approach—it by no means represents a full permissioning system for the database—but something like that might get us 80% of what's needed, without spinning up a separate Terraform/Permifrost deployment. (@saimaddali I'd be curious to hear what you think about this, since we were discussing it the other week) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Describe the feature
Add support for managing non-table, non-view data warehouse objects.
Objects that I am thinking of (non-exhaustive):
Related:
dbt-external-tables
(masterfully) manages external source objects from several data warehouses, outside of whatdbt run
manages.Describe alternatives you've considered
This Slack community thread has discussion of ways community members are currently managing their data warehouse objects (primarily Terraform for Snowflake, and some interest in Permifrost).
This post from Jeremiah Hansen suggests to use schemachange for Snowflake database migrations.
Additional context
Terraform providers:
Who will this benefit?
Folks managing data warehouses.
Are you interested in contributing this feature?
Yes :-)
The text was updated successfully, but these errors were encountered: