Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Support ZORDER as a model config (#292) #297

Merged
merged 1 commit into from
Mar 17, 2023
Merged

Conversation

andrefurlan-db
Copy link
Collaborator

@andrefurlan-db andrefurlan-db commented Mar 16, 2023

Resolves #122

Description

This PR augments the dbt model config with:

config(
    materialized='incremental',
    zorder="column_A" | ["column_A", "column_B"]
)

Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement.

A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize

One can set it in the dbt-project.yaml or directly in the comman line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}"

Benefits:

This simplifies the project as it requires one less post hook for maintenance operations.

Potential improvements

  • Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ...
  • Do not fail model if optimize fails: warn if optimize fails, but not error out

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@andrefurlan-db andrefurlan-db temporarily deployed to azure-prod-peco March 16, 2023 17:33 — with GitHub Actions Inactive
Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I have some newbie questions. Nothing that blocks merge.

Comment on lines +114 to +116
{% macro optimize(relation) %}
{{ return(adapter.dispatch('optimize', 'dbt')(relation)) }}
{% endmacro %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I wondered why these lines are here. It seems to bring the dbt.optimize macro into the current namespace so you can reference it on line 131?

For my own understanding: Jinja supports importing macros similar to Python ({% import "../path/to/file" as <importable-name> %}), but it looks like dbt doesn't use that syntax. Instead we "import" functions by making local macros that call adapter.dispatch. Does that sound correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some dbt magic to allow for inheritance and user overwrites.

{% endmacro %}

{% macro databricks__optimize(relation) %}
{% if config.get('zorder', False) and config.get('file_format', 'delta') == 'delta' %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm unfamiliar with how much of Jinja2 dbt uses, but customarily the 'delta' string literal here would be a defined constant that's available to Jinja's context, so we can change it down the line. If that's not possible then this line is good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta is specific to our adapter. Not something that can come from jinja. The context is created by dbt-core.


{% macro databricks__optimize(relation) %}
{% if config.get('zorder', False) and config.get('file_format', 'delta') == 'delta' %}
{% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower != 'true' and var('databricks_skip_optimize', 'false')|lower != 'true' %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability can we check == 'false' rather than != 'true' same as we do on line 119?

Also, why are there two configurations (uppercase and lowercase) for the same thing? Will this confuse users?

Suggested change
{% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower != 'true' and var('databricks_skip_optimize', 'false')|lower != 'true' %}
{% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower == 'false' and var('databricks_skip_optimize', 'false')|lower == 'false' %}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really. The default is to not set anything. So we need to check if the string is 'true', which means that the user actually wants to skip. If this was a boolean check it would be fine, but as a string I don't think it helps.

{% if zorder is sequence and zorder is not string %}
zorder by (
{%- for col in zorder -%}
{{ col }}{% if not loop.last %}, {% endif %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As janky as this looks it's very common when writing Jinja FYI

{%- set zorder = config.get('zorder', none) -%}
optimize {{ relation }}
{# TODO: predicates here? WHERE ... #}
{% if zorder is sequence and zorder is not string %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose moving the check for zorder is string to line 139. zorder is sequence will not return True if zorder is a string. In the case where zorder is not sequence and is also not a string, Jinja will try to iterate over it and it will fail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. zorder should only be sequence or string. This same sort of jinja code is all over dbt and its macros and it works.

tests/unit/test_macros.py Show resolved Hide resolved
@@ -59,6 +59,7 @@ class DatabricksConfig(AdapterConfig):
options: Optional[Dict[str, str]] = None
merge_update_columns: Optional[str] = None
tblproperties: Optional[Dict[str, str]] = None
zorder: Optional[Union[List[str], str]] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this configuration per-relation? Or per project?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can define dbt configs on a per project, per folder, or per model basis

Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to rebase off of main in order to pick up the full matrix of integration tests.

Resolves #122

### Description
This PR augments the dbt model config with:

config(
    materialized='incremental',
    zorder="column_A" | ["column_A", "column_B"]
)
Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement.

A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize

One can set it in the dbt-project.yaml or directly in the comman line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}"

#### Benefits:
This simplifies the project as it requires one less post hook for maintenance operations.

#### Potential improvements
Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ...
Do not fail model if optimize fails: warn if optimize fails, but not error out

### Checklist
 I have run this code in development and it appears to resolve the stated issue
 This PR includes tests, or tests are not required/relevant for this PR
 I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.
@andrefurlan-db andrefurlan-db temporarily deployed to azure-prod-pecou March 16, 2023 19:11 — with GitHub Actions Inactive
@andrefurlan-db andrefurlan-db temporarily deployed to azure-prod-peco March 16, 2023 19:11 — with GitHub Actions Inactive
Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrefurlan-db andrefurlan-db merged commit 2dacef0 into main Mar 17, 2023
@andrefurlan-db andrefurlan-db deleted the staging-292 branch March 17, 2023 22:49
andrefurlan-db added a commit that referenced this pull request Apr 19, 2023
Resolves #122

### Description
This PR augments the dbt model config with:

config(
    materialized='incremental',
    zorder="column_A" | ["column_A", "column_B"]
)
Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement.

A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize

One can set it in the dbt-project.yaml or directly in the command line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}"

#### Benefits:
This simplifies the project as it requires one less post hook for maintenance operations.

#### Potential improvements
Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ...
Do not fail model if optimize fails: warn if optimize fails, but not error out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ZORDER on models as model configuration
2 participants