Adapter Relation/Materialization on Rails #8177
Replies: 2 comments 1 reply
-
Quick thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I'm not going to lie - a lot of this went over my head. If I boil this down to a ELI5: it sounds like we are aiming to decouple what is relation vs a materialization which we use a bit interchangeably right now. To create a new materialization, the main change we would have to the user experience (and partner adapter creation experience) is that they would have to make two macros, one that calls the relation and one that contains the create statement. They can also reuse the templates created in python? Overall, from what I can gather, this feels quite reasonable to scaling out but will be a major breaking change for both users (we have a lot of custom materializations being used in production today) and vendor supported adapters (let alone the community ones). It feels like a dbt 2.0 item list. |
Beta Was this translation helpful? Give feedback.
-
Problem Statement
The Adapters team is looking at updating the way relations and materializations are managed within the adapter repos and
dbt-core
itself. This will require a more mature data model than we presently offer. The goal is to make an adapter easier to maintain and functionality easier to extend. Our current approach is two-fold. We have aBaseRelation
class that gets subclassed once (usually) per adapter. And we have materializations who logic entirely resides in a jinja template. This was sufficient for our initial materializations; they were generally one-to-one with their relations. Examples includetable
,view
, andseed
(for the most part). However, we have added more complicated materializations since then, and re-used many existing relations. This organic growth lead to quick delivery, but grew cumbersome to extend further.Solution
Relation vs. Materialization
We will start by drawing a hard separation between the concept of a
Relation
and the concept of aMaterialization
. These two concepts are often conflated, so I will provide a definition for use within the context of this discussion:Relation
A
Relation
is a database object that could present data in some way. Examples includetable
,view
,materialized_view
, anddynamic_table
. The concept of a relation is very much in the "what" side of things. A relation does not care how it's data is created, updated, etc.; it only cares about how its structure is created and updated. As an example, atable
is atable
whether it's always updated viadrop
/create
(table
materialization) or whether it's loaded with updates only (incremental
materialization).Materialization
A
Materialization
is a strategy that is applied to one or moreRelation
objects with an objective of updating the data in that(those) relation(s). Atable
materialization is the strategy of usingdrop
/create
to update data in atable
relation. Aseed
materialization is the strategy of using a csv to upload data into atable
relation. Materializations often need multiple relations in order to be executed (e.g. target, intermediate, backup). And it's conceivable that there is such a materialization where "the" target relation is actually multiple relations. A home-grown implementation of materialized views using a table, view, and stored procedure is such an example.Jinja vs. Python
There is no concept of a
Materialization
withindbt-core
other than what is found in the jinja materializations. There are concepts that are similar, likeCompiledNode
orParsedNode
, but these serve many purposes and are much bigger than the concept of a materialization. Inevitably we parse these objects, which are available in the global jinja context, right in the jinja template. This puts all of the orchestration logic in jinja as well, which makes it very difficult to test. Jinja is very good for templating; Python is very good for articulating logic flow. We should use these tools for what they are meant for.Materialization vs CompiledNode
In general, a
ParsedNode
or aCompiledNode
contains all of the information to produce aMaterialization
, but in a format that can be difficult to parse. We need to serve multiple implementations of various types of relations across many database platforms. A single class becomes unwieldy to maintain. A node should be allowed to concern itself with where it sits in the graph and how it interacts with other nodes without the burden of also determining how to implement its materialization and update its relations (and vice versa).Implementation
Use
BaseAdapter
as a service layerBaseAdapter
has gotten very wide; it has a lot of methods on it that are just dispatches to components.BaseAdapter
should really be a service layer for the jinja context. Something in the template asksBaseAdapter
for an object andBaseAdapter
goes and gets it by combining its factories. That object should then have all of the data to do its job. Example:This is a very basic example, but it demonstrates that the logic is in python and the results of the logic are available as properties that are easily accessed in jinja.
Distinguish between
Relation
andRelationComponent
A
Relation
corresponds roughly to a dbt model and represents a way to access a single dataset in a database. That dataset could be physical, e.g. atable
, or could be virtual, e.g. aview
. ARelation
has a single "primary" database object that itself contains a collection of smaller database objects. For example, atable
in Postgres has the following components: schema, database, columns, indexes, grants. Without the original table, these components have nothing to be tied to. This latter set of objects will be referred to asRelationComponent
. We are effectively makingRelation
the top level database object; everything else becomes aRelationComponent
. We are taking a bit of liberty with components likeschema
anddatabase
, and we are saying these are attributes ofRelation
, even though they feel like higher collections ofRelation
. This reflects the concept that we are focusing on aRelation
level for all implementation, even if we only need to alter one component. A reflection of this sentiment appears in the dbt enumRelationType
. The values in this enum all representRelation
objects and notRelationComponent
objects;table
andview
are included,schema
andindex
are not.Focusing at the
Relation
level requires some translation between the database vernacular and the dbt vernacular. We can use Postgres indexes as an example. Once a table is created in postgres, one can create an index independently by indicating the table to which it belongs. In that sense, the index is independent and is the focus; hence it' acreate
statement on anindex
in postgres vernacular. However, if we shift to aRelation
focused mindset, this is really altering the table to add an index; hence it's analter
statement on atable
in dbt vernacular.Takeaways:
table
,view
,materialized_view
, andcte
are all examples of aRelation
schema
,database
,index
, andgrant
are all examples of aRelationComponent
Relation
instances have aRelationType
Organize Jinja Footprint
Ultimately Jinja does not care how macros are organized in files and directories; but humans do care. We should organize macros by their use and whether there is an expectation for adapter maintainers to overwrite them, or not overwrite them. We boast about having a very configurable tool, but we don't always have an easy way to configure one piece of it. "Do you want to configure how to drop your relation? Overwrite
drop_relation
. Do you want to add one new relation that doesn't quite fit the existingdrop_relation
implementation? Still overwritedrop_relation
. Do you need to drop a relation as part of your materialization? You can't usedrop_relation
, so you'll likely just write the drop statement explicitly." This should be a little more flexible with more entrypoints.Distinguish between macros that template and macros that execute
In many dbt macros, we articulate how to do something, say
drop table my_table if exists
, and then also do that thing by wrapping that statement in acall
statement. The problem with doing that is that code is not reusable in many situations. For example, when building the sql to execute a materialization, it's often necessary to drop something, either an existing relation, or a backup. However, sincedrop_relation
has acall
in it, and the materialization will also need to becall
ed,drop_relation
needs to be implemented twice. Instead, we should create sql in templates that can be reused, and then leave thecall
statement as the last step. In addition to this distinction, it's worth noting that dbt now templates both sql and python due to python models. So the suffix of _sql is no longer appropriate. An alternative is to create a suffix per language, which would then make _sql relevant again; however, that would require some level of indirection to know that "_sql" means the macro doesn't actually do the thing, it just provides how to do the thing. "_template" is much more straight forward.Apply a tiered structure for relation macros
The jinja macro structure should be designed to be extensible, not replaceable. If the only entrypoint is the top level (i.e. the database function is a single macro), then the entire database function needs to be overwritten. This means that the overwritten macro needs to also be maintained in parallel with the default in
dbt-core
. If the macro is instead broken up into a few pieces, it's possible to extend that database function while not overwriting the piece that's already implemented indbt-core
. Let's usecreate
as an example:create_template(relation)
This macro can accept any relation and should work in any adapter; said another way, it's both relation-agnostic and adapter-agnostic. Barring edge cases and specific circumstances, this should be the macro that's called to create a relation, even if the relation type is known. It's primary function is to dispatch to the relation-specific, adapter-agnostic macro. There is no sql in here, it's mostly an if/else block along with any function that we want to do in all
create
scenarios (e.g. logging).This macro has a default implemented:
default__create_template(relation)
. The only reason to overwrite the default implementation of this macro is if the adapter supports more relation types thandbt-core
does (e.g. Snowflake implements dynamic tables). In that case, this should be overwritten to check for that specific case, and otherwise calldefault__create_template(relation)
, much like an abstract method. This represents the first entrypoint;dbt-core
can be extended to add a relation type that is adapter-specific while still using the existingdbt-core
workflows.create_view_template(relation)
This macro is relation-aware or relation-specific, but it's still adapter-agnostic. It's primary function is to be overridden by the adapter or throw an exception in the event that the adapter has not implemented this relation type (in this case
view
, so unlikely).This macro's default implementation,
default__create_view_template(relation)
, should only raise an exception that communicates thatcreate
has not been implemented forview
relations on this adapter. It's like raisingNotImplementedError
for an abstract method, but for jinja. The adapter is not forced to implement it, but if it calls it, this will remind the maintainer to either implement it or address the macro that called it.The adapter-specific implementation that overrides this macro should contain the sql that creates the relation, in this case a
view
. There should be no consideration of materializations or use-case specific implementations. That should happen in the python component. Instead, this macro should look very similar to the syntax provided in the database's docs. As such, this should not really need to be maintained much, unless the database adds new features to this relation ordbt-core
/dbt-<adapter>
decides to implement more options on this relation.create_dynamic_table_template(relation)
(an adapter-specific relation)dbt-core
can now be extended to support an adapter-specific relation without overwriting the existing relations. The maintainer would need to provide two macros within the adapter. Let's use a dynamic table as an example; it's a new relation in Snowflake. Here's howcreate
can be implemented for a dynamic table:Write composite operations in terms of relation-agnostic macros
There are several operations that dbt performs where are composites of several database operations. The most obvious example is a materialization, which combines
create
,drop
,rename
,alter
, etc. in order to execute. However, there are more simple, but still commonly used, operations that are composites of these atomic database operations. A very common example isreplace
. There is no generalreplace
database operation. It's a combination ofcreate
,drop
, and potentiallyrename
for staging and backing up. This workflow can be described independent of relation type. If the macro is written at that relation-agnostic level, in particular by not reproducing sql directly, it is unlikely that it will need to be implemented at the adapter level. This is one of the benefits of creating macros such ascreate_template(relation)
.Proposed structure:
Note the following observations:
The macros in /macros/relations/<relation_type> will provide an interface and a default that raises an exception:
dbt-postgres
sql intodbt-redshift
? Yes. We will do it once, and if it ever deviates, we don't need to worry about whether we breakdbt-redshift
because we updatedbt-postgres
or vice versa.dbt-core
to support non-built-in functionality (e.g. adding a materialization that's not a built-in, hence not in the dispatch flow of the relation-agnostic macro)Takeaways:
_template
suffix (e.g.create_template(relation)
on macros to indicate that it produces code that can be executed, but is not yet executedcall
in the macro unless the macro is never intended to be used by another macro (e.g. a materialization)create_template
anddefault__create_template
); these should always be in the same file since they are effectively the same macrocreate_template
create_table_template
create_template(relation)
, the rest is boiler platecreate
take a bunch of kwargs, make it take a relationindexes
is in the config because it'sdbt-postgres
, it should be on theRelation
instanceOrganize Adapter-land in
dbt-core
First of all, by "Adapter-land", I am referring to the portion of
dbt-core
which could arguably be a stand alone application that specifies what is needed from an adapter fordbt-core
to function. In effect, it's a collection of components that articulate specific tasks against the database. We should carve them out as separate things and encapsulate them to the extent possible. UltimatelyBaseAdapter
is then just an entrypoint that combines each of these modules to provide a service layer.Create encapsulated components with limited entrypoints
An adapter needs access to different groupings of functionality, or components. One component may need access to another component's objects in order to do something (e.g. a
Materialization
needs aRelation
); however it should not need to know how to do something with that other object (e.g. how to create a backup relation). These components should use factories to produce instances and service layers to perform actions. These entrypoints should be enforced with common python practices, such as leading underscores to indicate private modules. This allows for more flexibility when updating code and provides some guidance when writing tests (tests should be written on the service layer, not the private implementation layer).Given that each component has it's own sandbox, functionality should be split similarly across modules. A component should likely be more than a single class. If it's a single class, either that class is too large and should be broken up, or the component really isn't a component. A module could have more than a single class. It may make sense to put
TableRelation
andTableRelationChangeset
in the same file; but it doesn't make sense to putIndexRelation
in that file just because tables have indexes. Similarly, a factory should exist separate from its models since there is a clear hierarchy there.Proposed structure (and application components):
Takeaways:
Beta Was this translation helpful? Give feedback.
All reactions