Adapter Relation/Materialization on Rails #8177

mikealfare · 2023-07-20T20:33:48Z

mikealfare
Jul 20, 2023
Maintainer

Problem Statement

The Adapters team is looking at updating the way relations and materializations are managed within the adapter repos and dbt-core itself. This will require a more mature data model than we presently offer. The goal is to make an adapter easier to maintain and functionality easier to extend. Our current approach is two-fold. We have a BaseRelation class that gets subclassed once (usually) per adapter. And we have materializations who logic entirely resides in a jinja template. This was sufficient for our initial materializations; they were generally one-to-one with their relations. Examples include table, view, and seed (for the most part). However, we have added more complicated materializations since then, and re-used many existing relations. This organic growth lead to quick delivery, but grew cumbersome to extend further.

Solution

Relation vs. Materialization

We will start by drawing a hard separation between the concept of a Relation and the concept of a Materialization. These two concepts are often conflated, so I will provide a definition for use within the context of this discussion:

Relation

A Relation is a database object that could present data in some way. Examples include table, view, materialized_view, and dynamic_table. The concept of a relation is very much in the "what" side of things. A relation does not care how it's data is created, updated, etc.; it only cares about how its structure is created and updated. As an example, a table is a table whether it's always updated via drop/create (table materialization) or whether it's loaded with updates only (incremental materialization).

Materialization

A Materialization is a strategy that is applied to one or more Relation objects with an objective of updating the data in that(those) relation(s). A table materialization is the strategy of using drop/create to update data in a table relation. A seed materialization is the strategy of using a csv to upload data into a table relation. Materializations often need multiple relations in order to be executed (e.g. target, intermediate, backup). And it's conceivable that there is such a materialization where "the" target relation is actually multiple relations. A home-grown implementation of materialized views using a table, view, and stored procedure is such an example.

Jinja vs. Python

There is no concept of a Materialization within dbt-core other than what is found in the jinja materializations. There are concepts that are similar, like CompiledNode or ParsedNode, but these serve many purposes and are much bigger than the concept of a materialization. Inevitably we parse these objects, which are available in the global jinja context, right in the jinja template. This puts all of the orchestration logic in jinja as well, which makes it very difficult to test. Jinja is very good for templating; Python is very good for articulating logic flow. We should use these tools for what they are meant for.

Materialization vs CompiledNode

In general, a ParsedNode or a CompiledNode contains all of the information to produce a Materialization, but in a format that can be difficult to parse. We need to serve multiple implementations of various types of relations across many database platforms. A single class becomes unwieldy to maintain. A node should be allowed to concern itself with where it sits in the graph and how it interacts with other nodes without the burden of also determining how to implement its materialization and update its relations (and vice versa).

Implementation

Use `BaseAdapter` as a service layer

BaseAdapter has gotten very wide; it has a lot of methods on it that are just dispatches to components. BaseAdapter should really be a service layer for the jinja context. Something in the template asks BaseAdapter for an object and BaseAdapter goes and gets it by combining its factories. That object should then have all of the data to do its job. Example:

class BaseAdapter
    ...
    relation_factory = RelationFactory(
        # args that map types to relations, relation-level metadata specific to the adapter, etc.
    )

    materialization_factory = MaterializationFactory(
        # a relation factory and args that contain materialization-level metadata specific to the adapter
    )

    def make_materialization_from_node(self, node: ParsedNode) -> Materialization:
        materialization = self.materialization_factory.make_from_node(node)
        ...
        return materialization
    ...

{% materialization abc %}
    {% set abc_materialization = adapter.make_materialization_from_node(config.model) %}

    {% if abc_materialization.strategy == 'create' %}
        {{ create_template(abc_materialization.target_relation) }}
    {% elif abc_materialization.strategy == 'replace' %}
        {{ replace_template(abc_materialization.existing_relation, abc_materialization.target_relation) }}
    ...

{% endmaterialization %}

This is a very basic example, but it demonstrates that the logic is in python and the results of the logic are available as properties that are easily accessed in jinja.

Distinguish between `Relation` and `RelationComponent`

A Relation corresponds roughly to a dbt model and represents a way to access a single dataset in a database. That dataset could be physical, e.g. a table, or could be virtual, e.g. a view. A Relation has a single "primary" database object that itself contains a collection of smaller database objects. For example, a table in Postgres has the following components: schema, database, columns, indexes, grants. Without the original table, these components have nothing to be tied to. This latter set of objects will be referred to as RelationComponent. We are effectively making Relation the top level database object; everything else becomes a RelationComponent. We are taking a bit of liberty with components like schema and database, and we are saying these are attributes of Relation, even though they feel like higher collections of Relation. This reflects the concept that we are focusing on a Relation level for all implementation, even if we only need to alter one component. A reflection of this sentiment appears in the dbt enum RelationType. The values in this enum all represent Relation objects and not RelationComponent objects; table and view are included, schema and index are not.

Focusing at the Relation level requires some translation between the database vernacular and the dbt vernacular. We can use Postgres indexes as an example. Once a table is created in postgres, one can create an index independently by indicating the table to which it belongs. In that sense, the index is independent and is the focus; hence it' a create statement on an index in postgres vernacular. However, if we shift to a Relation focused mindset, this is really altering the table to add an index; hence it's an alter statement on a table in dbt vernacular.

Takeaways:

table, view, materialized_view, and cte are all examples of a Relation
schema, database, index, and grant are all examples of a RelationComponent
only Relation instances have a RelationType

Organize Jinja Footprint

Ultimately Jinja does not care how macros are organized in files and directories; but humans do care. We should organize macros by their use and whether there is an expectation for adapter maintainers to overwrite them, or not overwrite them. We boast about having a very configurable tool, but we don't always have an easy way to configure one piece of it. "Do you want to configure how to drop your relation? Overwrite drop_relation. Do you want to add one new relation that doesn't quite fit the existing drop_relation implementation? Still overwrite drop_relation. Do you need to drop a relation as part of your materialization? You can't use drop_relation, so you'll likely just write the drop statement explicitly." This should be a little more flexible with more entrypoints.

Distinguish between macros that template and macros that execute

In many dbt macros, we articulate how to do something, say drop table my_table if exists, and then also do that thing by wrapping that statement in a call statement. The problem with doing that is that code is not reusable in many situations. For example, when building the sql to execute a materialization, it's often necessary to drop something, either an existing relation, or a backup. However, since drop_relation has a call in it, and the materialization will also need to be called, drop_relation needs to be implemented twice. Instead, we should create sql in templates that can be reused, and then leave the call statement as the last step. In addition to this distinction, it's worth noting that dbt now templates both sql and python due to python models. So the suffix of _sql is no longer appropriate. An alternative is to create a suffix per language, which would then make _sql relevant again; however, that would require some level of indirection to know that "_sql" means the macro doesn't actually do the thing, it just provides how to do the thing. "_template" is much more straight forward.

Apply a tiered structure for relation macros

The jinja macro structure should be designed to be extensible, not replaceable. If the only entrypoint is the top level (i.e. the database function is a single macro), then the entire database function needs to be overwritten. This means that the overwritten macro needs to also be maintained in parallel with the default in dbt-core. If the macro is instead broken up into a few pieces, it's possible to extend that database function while not overwriting the piece that's already implemented in dbt-core. Let's use create as an example:

`create_template(relation)`

This macro can accept any relation and should work in any adapter; said another way, it's both relation-agnostic and adapter-agnostic. Barring edge cases and specific circumstances, this should be the macro that's called to create a relation, even if the relation type is known. It's primary function is to dispatch to the relation-specific, adapter-agnostic macro. There is no sql in here, it's mostly an if/else block along with any function that we want to do in all create scenarios (e.g. logging).

This macro has a default implemented: default__create_template(relation). The only reason to overwrite the default implementation of this macro is if the adapter supports more relation types than dbt-core does (e.g. Snowflake implements dynamic tables). In that case, this should be overwritten to check for that specific case, and otherwise call default__create_template(relation), much like an abstract method. This represents the first entrypoint; dbt-core can be extended to add a relation type that is adapter-specific while still using the existing dbt-core workflows.

`create_view_template(relation)`

This macro is relation-aware or relation-specific, but it's still adapter-agnostic. It's primary function is to be overridden by the adapter or throw an exception in the event that the adapter has not implemented this relation type (in this case view, so unlikely).

This macro's default implementation, default__create_view_template(relation), should only raise an exception that communicates that create has not been implemented for view relations on this adapter. It's like raising NotImplementedError for an abstract method, but for jinja. The adapter is not forced to implement it, but if it calls it, this will remind the maintainer to either implement it or address the macro that called it.

The adapter-specific implementation that overrides this macro should contain the sql that creates the relation, in this case a view. There should be no consideration of materializations or use-case specific implementations. That should happen in the python component. Instead, this macro should look very similar to the syntax provided in the database's docs. As such, this should not really need to be maintained much, unless the database adds new features to this relation or dbt-core/dbt-<adapter> decides to implement more options on this relation.

`create_dynamic_table_template(relation)` (an adapter-specific relation)

dbt-core can now be extended to support an adapter-specific relation without overwriting the existing relations. The maintainer would need to provide two macros within the adapter. Let's use a dynamic table as an example; it's a new relation in Snowflake. Here's how create can be implemented for a dynamic table:

{% macro snowflake__create_template(relation) %}
    {% if relation.type == 'dynamic_table' %}
        {{ snowflake__create_dynamic_table_template(relation) }}
    {% else %}
        {{ default__create_template(relation)
    {% endif %}
{% endmacro %}

{% macro snowflake__create_dynamic_table_template(dynamic_table) %}
    create dynamic table {{ dynamic_table.fully_qualified_path }}
        warehouse = {{ dynamic_table.warehouse }}
        target_lag = {{ dynamic_table.target_lag }}
    as (
        {{ dynamic_table.query }}
    )
{% endmacro %}

Write composite operations in terms of relation-agnostic macros

There are several operations that dbt performs where are composites of several database operations. The most obvious example is a materialization, which combines create, drop, rename, alter, etc. in order to execute. However, there are more simple, but still commonly used, operations that are composites of these atomic database operations. A very common example is replace. There is no general replace database operation. It's a combination of create, drop, and potentially rename for staging and backing up. This workflow can be described independent of relation type. If the macro is written at that relation-agnostic level, in particular by not reproducing sql directly, it is unlikely that it will need to be implemented at the adapter level. This is one of the benefits of creating macros such as create_template(relation).

Proposed structure:

- macros
  - materializations
    - incremental.sql
    - seed.sql
    - table.sql
    - ...
  - metadata
  - relations
    - table
      - create.sql
      - drop.sql
      - ...
    - view
      - create.sql
      - ...
    - create.sql
    - drop.sql
    - replace.sql
    - ...
  - relation_components
    - column
      - alter.sql
      - describe_query.sql
      - describe_relation.sql
      - ...
    - comment
    - index
    - schema
    - ...
  - utils

Note the following observations:

There are macros in the root /relations directory in /macros and maros in /macros/relations/<relation_type> directories. The macros in /macros/relations are relation-type agnostic. For the most part these just dispatch to the appropriate relation-type specific macro:

{% macro create_template(relation) %}
    {% if relation.type == 'table' %}
        {{ create_table_template(relation) }}
    {% elif relation.type == 'view' %}
        {{ create_view_template(relation) }}
    ...
{% endmacro %}

The macros in /macros/relations/<relation_type> will provide an interface and a default that raises an exception:

{% macro create_table_template(relation) %}
    {{ return(adapter.dispatch('create_table_template('relation', 'dbt')(relation)) }}
{% endmacro %}

{% macro default__create_table_template(relation) %}
    {{ execptions.raise_compiler_error("Create has not been implemented for tables.") }}
{% endmacro %}

In general, adapter maintainers should only need to override macros in the /macros/relations/<relation_type>/* files; and it's very likely that they'll need to override them
- We do not want to assume syntax if we can help it; this avoids inadvertently coupling adapters. Does that mean that we will likely copy/paste dbt-postgres sql into dbt-redshift? Yes. We will do it once, and if it ever deviates, we don't need to worry about whether we break dbt-redshift because we update dbt-postgres or vice versa.
In general, adapter maintainers should not need to override macros outside of /macros/relations/<relation_type>/* files; by doing so, they are acknowledging that this is extending dbt-core to support non-built-in functionality (e.g. adding a materialization that's not a built-in, hence not in the dispatch flow of the relation-agnostic macro)

Takeaways:

Use the _template suffix (e.g. create_template(relation) on macros to indicate that it produces code that can be executed, but is not yet executed
Don't use a call in the macro unless the macro is never intended to be used by another macro (e.g. a materialization)
Limit a macro file to a single macro; macro files can be collected in a directory if they are related
- The exception to this is pairing a default implementation with its interface (e.g. create_template and default__create_template); these should always be in the same file since they are effectively the same macro
Create relation-agnostic (think polymorphic) macros that can be extended, e.g. create_template
Create relation-specific macros that can be overridden, e.g. create_table_template
Write composite macros using relation-agnostic macros whenever possible
Call relation-agnostic macros when possible
- Don't worry about figuring out if a relation is a table or a view to determine what syntax to use; instead use create_template(relation), the rest is boiler plate
Pass whole objects into relation macros
- The attributes needed to create a table vary by database, so don't make create take a bunch of kwargs, make it take a relation
- Don't assume indexes is in the config because it's dbt-postgres, it should be on the Relation instance
Make it clear where macros should be overridden by an adapter
Make it clear where an adapter is deviating from built-in functionality by creating new macros

Organize Adapter-land in `dbt-core`

First of all, by "Adapter-land", I am referring to the portion of dbt-core which could arguably be a stand alone application that specifies what is needed from an adapter for dbt-core to function. In effect, it's a collection of components that articulate specific tasks against the database. We should carve them out as separate things and encapsulate them to the extent possible. Ultimately BaseAdapter is then just an entrypoint that combines each of these modules to provide a service layer.

Create encapsulated components with limited entrypoints

An adapter needs access to different groupings of functionality, or components. One component may need access to another component's objects in order to do something (e.g. a Materialization needs a Relation); however it should not need to know how to do something with that other object (e.g. how to create a backup relation). These components should use factories to produce instances and service layers to perform actions. These entrypoints should be enforced with common python practices, such as leading underscores to indicate private modules. This allows for more flexibility when updating code and provides some guidance when writing tests (tests should be written on the service layer, not the private implementation layer).

Given that each component has it's own sandbox, functionality should be split similarly across modules. A component should likely be more than a single class. If it's a single class, either that class is too large and should be broken up, or the component really isn't a component. A module could have more than a single class. It may make sense to put TableRelation and TableRelationChangeset in the same file; but it doesn't make sense to put IndexRelation in that file just because tables have indexes. Similarly, a factory should exist separate from its models since there is a clear hierarchy there.

Proposed structure (and application components):

- dbt
  - adapters
    - cache
      - ???
    - connection
      - ???
    - materialization
      - models
        - _materialization.py
        - _materialized_view.py
        - _table.py
        - ...
      - factory.py
    - relation
      - models
        - _database.py
        - _relation.py
        - _table.py
        - _view.py
        - ...
      - factory.py
    - validation
    - ...

Takeaways:

Use the leading underscore convention to indicate private modules and packages
Write unit tests against the service layer; avoid referencing private methods, classes, or functions
A python module (file) should do a single thing; packages (directories) can be used to group related modules

colin-rogers-dbt · 2023-07-21T00:59:31Z

colin-rogers-dbt
Jul 21, 2023
Maintainer

Quick thoughts:

Can you say more on "This should be a little more flexible with more entrypoints."? In particular I'd like to flesh out relation_components vs relations.
The conventions being proposed seem to tie in with the higher level points at the start can you make the connection more explicit?
Per Less macros per file convention I think we could adopt a more prescriptive stance here. I think each .sql/.py file should do a thing and avoid being a collection of related things.

1 reply

mikealfare Jul 21, 2023
Maintainer Author

I think I worked all of this in. I'd be interested in your feedback. In particular, I threw in an example where I would put two classes in the same module (both a relation and its changeset). This makes sense to me on the surface, but I put it there as an example so we could discuss how close to the one class/file ratio we want to get.

amychen1776 · 2023-08-03T13:46:31Z

amychen1776
Aug 3, 2023
Collaborator

I'm not going to lie - a lot of this went over my head. If I boil this down to a ELI5: it sounds like we are aiming to decouple what is relation vs a materialization which we use a bit interchangeably right now. To create a new materialization, the main change we would have to the user experience (and partner adapter creation experience) is that they would have to make two macros, one that calls the relation and one that contains the create statement. They can also reuse the templates created in python?

Overall, from what I can gather, this feels quite reasonable to scaling out but will be a major breaking change for both users (we have a lot of custom materializations being used in production today) and vendor supported adapters (let alone the community ones). It feels like a dbt 2.0 item list.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapter Relation/Materialization on Rails #8177

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Adapter Relation/Materialization on Rails #8177

mikealfare Jul 20, 2023 Maintainer

Problem Statement

Solution

Relation vs. Materialization

Relation

Materialization

Jinja vs. Python

Materialization vs CompiledNode

Implementation

Use BaseAdapter as a service layer

Distinguish between Relation and RelationComponent

Organize Jinja Footprint

Distinguish between macros that template and macros that execute

Apply a tiered structure for relation macros

create_template(relation)

create_view_template(relation)

create_dynamic_table_template(relation) (an adapter-specific relation)

Write composite operations in terms of relation-agnostic macros

Proposed structure:

Note the following observations:

Organize Adapter-land in dbt-core

Create encapsulated components with limited entrypoints

Proposed structure (and application components):

Replies: 2 comments · 1 reply

colin-rogers-dbt Jul 21, 2023 Maintainer

mikealfare Jul 21, 2023 Maintainer Author

amychen1776 Aug 3, 2023 Collaborator

mikealfare
Jul 20, 2023
Maintainer

Use `BaseAdapter` as a service layer

Distinguish between `Relation` and `RelationComponent`

`create_template(relation)`

`create_view_template(relation)`

`create_dynamic_table_template(relation)` (an adapter-specific relation)

Organize Adapter-land in `dbt-core`

Replies: 2 comments 1 reply

colin-rogers-dbt
Jul 21, 2023
Maintainer

mikealfare Jul 21, 2023
Maintainer Author

amychen1776
Aug 3, 2023
Collaborator