New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[feat] Support ZORDER as a model config (#292) #297

Merged

andrefurlan-db merged 1 commit into main from staging-292

Mar 17, 2023

Collaborator

andrefurlan-db commented Mar 16, 2023 •

edited

Loading

Resolves #122

Description

This PR augments the dbt model config with:

config(
    materialized='incremental',
    zorder="column_A" | ["column_A", "column_B"]
)

Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement.

A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize

One can set it in the dbt-project.yaml or directly in the comman line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}"

Benefits:

This simplifies the project as it requires one less post hook for maintenance operations.

Potential improvements

Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ...
Do not fail model if optimize fails: warn if optimize fails, but not error out

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

andrefurlan-db requested review from susodapop and ueshin as code owners

March 16, 2023 17:03

andrefurlan-db temporarily deployed to azure-prod-peco

March 16, 2023 17:33

— with

GitHub Actions Inactive

susodapop approved these changes

View reviewed changes

susodapop left a comment

Overall looks good. I have some newbie questions. Nothing that blocks merge.

dbt/include/databricks/macros/adapters.sql

Comment on lines +114 to +116

+              {% macro optimize(relation) %}
+                {{ return(adapter.dispatch('optimize', 'dbt')(relation)) }}
+              {% endmacro %}

susodapop Mar 16, 2023

At first I wondered why these lines are here. It seems to bring the dbt.optimize macro into the current namespace so you can reference it on line 131?

For my own understanding: Jinja supports importing macros similar to Python ({% import "../path/to/file" as <importable-name> %}), but it looks like dbt doesn't use that syntax. Instead we "import" functions by making local macros that call adapter.dispatch. Does that sound correct?

Collaborator Author

andrefurlan-db Mar 16, 2023

this is some dbt magic to allow for inheritance and user overwrites.

dbt/include/databricks/macros/adapters.sql

+              {% endmacro %}
+              {% macro databricks__optimize(relation) %}
+                {% if config.get('zorder', False) and config.get('file_format', 'delta') == 'delta' %}

susodapop Mar 16, 2023

Again I'm unfamiliar with how much of Jinja2 dbt uses, but customarily the 'delta' string literal here would be a defined constant that's available to Jinja's context, so we can change it down the line. If that's not possible then this line is good.

Collaborator Author

andrefurlan-db Mar 16, 2023

delta is specific to our adapter. Not something that can come from jinja. The context is created by dbt-core.

dbt/include/databricks/macros/adapters.sql

+              {% macro databricks__optimize(relation) %}
+                {% if config.get('zorder', False) and config.get('file_format', 'delta') == 'delta' %}
+                  {% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower != 'true' and var('databricks_skip_optimize', 'false')|lower != 'true' %}

susodapop Mar 16, 2023

For readability can we check == 'false' rather than != 'true' same as we do on line 119?

Also, why are there two configurations (uppercase and lowercase) for the same thing? Will this confuse users?

Suggested change

      
                {% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower != 'true' and var('databricks_skip_optimize', 'false')|lower != 'true' %}
          
                {% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower == 'false' and var('databricks_skip_optimize', 'false')|lower == 'false' %}

Collaborator Author

andrefurlan-db Mar 16, 2023

not really. The default is to not set anything. So we need to check if the string is 'true', which means that the user actually wants to skip. If this was a boolean check it would be fine, but as a string I don't think it helps.

dbt/include/databricks/macros/adapters.sql

+                  {% if zorder is sequence and zorder is not string %}
+                    zorder by (
+                      {%- for col in zorder -%}
+                      {{ col }}{% if not loop.last %}, {% endif %}

susodapop Mar 16, 2023

As janky as this looks it's very common when writing Jinja FYI

dbt/include/databricks/macros/adapters.sql

+                   {%- set zorder = config.get('zorder', none) -%}
+                  optimize {{ relation }}
+                  {# TODO: predicates here? WHERE ...  #}
+                  {% if zorder is sequence and zorder is not string %}

susodapop Mar 16, 2023

I would propose moving the check for zorder is string to line 139. zorder is sequence will not return True if zorder is a string. In the case where zorder is not sequence and is also not a string, Jinja will try to iterate over it and it will fail.

Collaborator Author

andrefurlan-db Mar 17, 2023

I don't understand. zorder should only be sequence or string. This same sort of jinja code is all over dbt and its macros and it works.

tests/unit/test_macros.py Show resolved Hide resolved

dbt/adapters/databricks/impl.py

@@ @@ -59,6 +59,7 @@ class DatabricksConfig(AdapterConfig): @@
                   options: Optional[Dict[str, str]] = None
                   merge_update_columns: Optional[str] = None
                   tblproperties: Optional[Dict[str, str]] = None
+                  zorder: Optional[Union[List[str], str]] = None

susodapop Mar 16, 2023

Is this configuration per-relation? Or per project?

Collaborator Author

andrefurlan-db Mar 17, 2023

you can define dbt configs on a per project, per folder, or per model basis

susodapop suggested changes

View reviewed changes

susodapop left a comment

You need to rebase off of main in order to pick up the full matrix of integration tests.


          [feat] Support ZORDER as a model config (#292)

850af5e

Resolves #122

### Description
This PR augments the dbt model config with:

config(
    materialized='incremental',
    zorder="column_A" | ["column_A", "column_B"]
)
Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement.

A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize

One can set it in the dbt-project.yaml or directly in the comman line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}"

#### Benefits:
This simplifies the project as it requires one less post hook for maintenance operations.

#### Potential improvements
Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ...
Do not fail model if optimize fails: warn if optimize fails, but not error out

### Checklist
 I have run this code in development and it appears to resolve the stated issue
 This PR includes tests, or tests are not required/relevant for this PR
 I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

andrefurlan-db force-pushed the staging-292 branch from 7ef2f2e to 850af5e Compare

March 16, 2023 18:44

andrefurlan-db temporarily deployed to azure-prod-pecou

March 16, 2023 19:11

— with

GitHub Actions Inactive

andrefurlan-db temporarily deployed to azure-prod-peco

March 16, 2023 19:11

— with

GitHub Actions Inactive

susodapop approved these changes

View reviewed changes

susodapop left a comment

LGTM

andrefurlan-db merged commit 2dacef0 into main

andrefurlan-db deleted the staging-292 branch

March 17, 2023 22:49

andrefurlan-db added a commit that referenced this pull request


          [feat] Support ZORDER as a model config (#292) (#297)

32b4242

Resolves #122

### Description
This PR augments the dbt model config with:

config(
    materialized='incremental',
    zorder="column_A" | ["column_A", "column_B"]
)
Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement.

A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize

One can set it in the dbt-project.yaml or directly in the command line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}"

#### Benefits:
This simplifies the project as it requires one less post hook for maintenance operations.

#### Potential improvements
Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ...
Do not fail model if optimize fails: warn if optimize fails, but not error out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet