-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Support ZORDER as a model config (#292) #297
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I have some newbie questions. Nothing that blocks merge.
{% macro optimize(relation) %} | ||
{{ return(adapter.dispatch('optimize', 'dbt')(relation)) }} | ||
{% endmacro %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I wondered why these lines are here. It seems to bring the dbt.optimize
macro into the current namespace so you can reference it on line 131?
For my own understanding: Jinja supports importing macros similar to Python ({% import "../path/to/file" as <importable-name> %}
), but it looks like dbt doesn't use that syntax. Instead we "import" functions by making local macros that call adapter.dispatch
. Does that sound correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is some dbt magic to allow for inheritance and user overwrites.
{% endmacro %} | ||
|
||
{% macro databricks__optimize(relation) %} | ||
{% if config.get('zorder', False) and config.get('file_format', 'delta') == 'delta' %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I'm unfamiliar with how much of Jinja2 dbt uses, but customarily the 'delta'
string literal here would be a defined constant that's available to Jinja's context, so we can change it down the line. If that's not possible then this line is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta is specific to our adapter. Not something that can come from jinja. The context is created by dbt-core.
|
||
{% macro databricks__optimize(relation) %} | ||
{% if config.get('zorder', False) and config.get('file_format', 'delta') == 'delta' %} | ||
{% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower != 'true' and var('databricks_skip_optimize', 'false')|lower != 'true' %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability can we check == 'false'
rather than != 'true'
same as we do on line 119?
Also, why are there two configurations (uppercase and lowercase) for the same thing? Will this confuse users?
{% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower != 'true' and var('databricks_skip_optimize', 'false')|lower != 'true' %} | |
{% if var('DATABRICKS_SKIP_OPTIMIZE', 'false')|lower == 'false' and var('databricks_skip_optimize', 'false')|lower == 'false' %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really. The default is to not set anything. So we need to check if the string is 'true', which means that the user actually wants to skip. If this was a boolean check it would be fine, but as a string I don't think it helps.
{% if zorder is sequence and zorder is not string %} | ||
zorder by ( | ||
{%- for col in zorder -%} | ||
{{ col }}{% if not loop.last %}, {% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As janky as this looks it's very common when writing Jinja FYI
{%- set zorder = config.get('zorder', none) -%} | ||
optimize {{ relation }} | ||
{# TODO: predicates here? WHERE ... #} | ||
{% if zorder is sequence and zorder is not string %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose moving the check for zorder is string
to line 139. zorder is sequence
will not return True if zorder
is a string. In the case where zorder is not sequence
and is also not a string, Jinja will try to iterate over it and it will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. zorder should only be sequence or string. This same sort of jinja code is all over dbt and its macros and it works.
@@ -59,6 +59,7 @@ class DatabricksConfig(AdapterConfig): | |||
options: Optional[Dict[str, str]] = None | |||
merge_update_columns: Optional[str] = None | |||
tblproperties: Optional[Dict[str, str]] = None | |||
zorder: Optional[Union[List[str], str]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this configuration per-relation? Or per project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can define dbt configs on a per project, per folder, or per model basis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to rebase off of main
in order to pick up the full matrix of integration tests.
Resolves #122 ### Description This PR augments the dbt model config with: config( materialized='incremental', zorder="column_A" | ["column_A", "column_B"] ) Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement. A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize One can set it in the dbt-project.yaml or directly in the comman line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}" #### Benefits: This simplifies the project as it requires one less post hook for maintenance operations. #### Potential improvements Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ... Do not fail model if optimize fails: warn if optimize fails, but not error out ### Checklist I have run this code in development and it appears to resolve the stated issue This PR includes tests, or tests are not required/relevant for this PR I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.
7ef2f2e
to
850af5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Resolves #122 ### Description This PR augments the dbt model config with: config( materialized='incremental', zorder="column_A" | ["column_A", "column_B"] ) Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement. A var is available to skip OPTIMIZE if necessary: databricks_skip_optimize One can set it in the dbt-project.yaml or directly in the command line: dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}" #### Benefits: This simplifies the project as it requires one less post hook for maintenance operations. #### Potential improvements Add predicates to OPTIMIZE statement. OPTIMIZE table WHERE ... Do not fail model if optimize fails: warn if optimize fails, but not error out
Resolves #122
Description
This PR augments the dbt model config with:
Under the hood, after every model with zorder is build (initial or incremental), we'll run a OPTIMIZE relation ZORDER BY () statement.
A var is available to skip OPTIMIZE if necessary:
databricks_skip_optimize
One can set it in the dbt-project.yaml or directly in the comman line:
dbt run --models stg_payments --vars "{'databricks_skip_optimize': 'true'}"
Benefits:
This simplifies the project as it requires one less post hook for maintenance operations.
Potential improvements
Checklist
CHANGELOG.md
and added information about my change to the "dbt-databricks next" section.