Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods to achieve null safety for deduplicate #815

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dbeatty10
Copy link
Contributor

@dbeatty10 dbeatty10 commented Jul 26, 2023

resolves #814
resolves #621

This is a bug fix with no breaking changes.

It also adds two new features:

  • optional row_alias keyword argument (type: string, default: none)
  • optional columns keyword argument (type: list, default: none)

Description & motivation

This PR is still in draft status, and more description will be added at a later date.

In the meantime, see #814 (and everything it links to, in particular #713) for background motivation and discussion to-date.

As a summary, this PR gives the user multiple options to achieve null safety for deduplicate:

  1. The user passes a relation that has been materialized in the database (i.e., it isn't an ephemeral model)
    • fe03f43 -- as long as the relation is not a CTE, it's columns can be fetched via the get_filtered_columns_in_relation macro
  2. The user passes the row_alias keyword argument
    • 3eced4d -- when the row_alias keyword argument is set, then we can deduplicate via the row_number() window function (at the cost of the row_alias being an extra column that wasn't in the original data set)
  3. The user passes the columns keyword argument
    • d46676e -- when columns keyword argument is set, then we can deduplicate via the row_number() window function and only return the requested columns

Outside of those options, the deduplication will not be null-safe.

Option 1

models/my_model_1.sql

{{ config(materialized="table") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_1.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_1'),
            partition_by='user_id, created_at',
            order_by='version desc'
        ) | indent
    }}
dbt build -s +deduped_1
dbt show -s deduped_1
user_id created_at version
1 2

Option 2

models/my_model_2.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_2.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_2'),
            partition_by='user_id, created_at',
            order_by='version desc',
            row_alias='rn'
        ) | indent
    }}
dbt build -s +deduped_2
dbt show -s deduped_2
user_id created_at version rn
1 2 1

Option 3

models/my_model_3.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_3.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_3'),
            partition_by='user_id, created_at',
            order_by='version desc',
            columns=['user_id', 'created_at', 'version']
        ) | indent
    }}
dbt build -s +deduped_3
dbt show -s deduped_3
user_id created_at version
1 2

Option 4

models/my_model_4.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

Warning

This is the one not guaranteed to be null-safe (depending on the adapter).

models/deduped_4.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_4'),
            partition_by='user_id, created_at',
            order_by='version desc',
        ) | indent
    }}
dbt build -s +deduped_4
dbt show -s deduped_4

Here's the warning that will be logged:

Warning: the implementation of the `deduplicate` macro for the `postgres` adapter is not null safe. 
Set `row_alias` within calls to `deduplicate` to achieve null safety (which will also add it as an extra column to the output).

e.g.,
    {
        dbt_utils.deduplicate(
            'my_cte',
            partition_by='user_id',
            order_by='version desc',
            row_alias='rn'
        ) | indent
    }

Warning triggered by model: my_project.deduped_4
dbt project / package: my_project
path: models/deduped_4.sql
user_id created_at version

Key history of deduplicate macro

Checklist

  • I followed guidelines to ensure that my changes will work on "non-core" adapters
  • 👈 I have updated the README.md (if applicable)
  • 👈 I have added tests & descriptions to my models (and macros if applicable)
  • 👈 I have added an entry to CHANGELOG.md

from {{ relation }} as _inner
)

select *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What databases allow for minus or except syntax? I know snowflake does - that could be an option for removing the extra column. Though maybe in that case you'd just use qualify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would minus or except work to remove extra column(s)? Do you mean select * exclude ( <col_name>, <col_name>, ... )?

This would be the perfect solution if we could rely on it! 💡

But it is not in the SQL standard, and the databases that don't have qualify are probably missing select * exclude (...) as well. So I don't think we'll be able to reliably use it as part of the default implementation 😢.


select * exclude (...)

Snowflake has select * exclude:

image

And so does DuckDB:

image

select * except (...)

And because it's not in the standard, other databases use except instead of exclude.

BigQuery uses except:

image

As does Databricks:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes I meant exclude. What about using the star macro with the except argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial implementation in #512 used the star macro but it was removed in #548.

I haven't considered the details of how we might be able to bring it back or what those implications would be.

I think we'd still need to handle the case where the relation is a CTE name instead of a Relation. That's the case that this draft PR is covering with the row_alias parameter. An alternative way to cover it would be a columns parameter like suggested here. Allowing the end user to choose between either row_alias or columns would provide the most optionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graciegoheen your idea about using the star macro inspired fe03f43.

It retrieves columns similarly to dbt_utils.star IFF:

  • relation is a Relation
  • relation is not an ephemeral CTE

Otherwise, a user can pass a list of columns manually (d46676e). Or they can specify a row_alias that is acceptable to them.

@@ -104,7 +104,10 @@ path: {}
{% set row_alias = kwargs.get('row_alias') %}
{% set columns = kwargs.get('columns') %}

{% if row_alias != None or columns != None %}
{% if relation.is_cte is defined and not relation.is_cte %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simplified alternative to this and this to determine if a Relation is backed by a real database object whose columns can be fetched via the star macro.

i.e., the columns for tables and views can be retrieved via information_schema.columns (or an equivalent), but can't for CTE.

@dbeatty10 dbeatty10 changed the title Null safety for deduplicate via row_alias keyword argument Methods to achieve null safety for deduplicate Aug 1, 2023
Copy link

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 29, 2024
@dbeatty10 dbeatty10 removed the Stale label Jan 29, 2024
@dbeatty10 dbeatty10 added the bug Something isn't working label Apr 18, 2024
Copy link

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deduplicate bug in Spark in case of a null column. Default implementation of deduplicate is not null-safe
2 participants